SGI announces Linux Kernel Crash Dumps (LKCD)
Alphix writes "SGI has announced their Linux Kernel Crash Dumps project - and it's gone to release. It's intended to simplify the examination of system crashes thru saving the kernel memory image when the system dies due to a software failure, recovering the kernel memory image when the system is rebooted and then examining the memory image to determine what happened when the failure occurred."
> And also, it's one of the things I really, really, really HATE about NT. No debugger comes with the OS, and there's no free, distributable one out there, so from a tech support standpoint, if your customer's server barfs, you kind of have to guess at what went wrong, or establish a pattern from multiple calls, or try to reproduce it in-house.
Yeah it sucks. The only solution that I'm aware of is to get your customer(s) to install MS Dev Studio, or even NuMega's SoftIce. Not very practical, but its better then nothing.
Cheers
I think you misunderstood him. Solaris (I assume) has the ability to dump core *for the kernel*. Obviously, not into the filesystem - thus the swap/savecore dance.
And no, it's not only for applications. And it's *very* useful.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
Crucify me for saying this, but I have identical results with NT as both a workstation and a server... Your mileage may vary.
--
E2 IN2 IE?
As you note, this is far more than a mere Kernel core dumper. I know this site attracts many professional developers and sysadmins, but there are far more who have never had the pleasure of driving IRIX. Linux is really good, given it's maturity level, and I use it at work and home - I develop for it at work for our products : see Ariel Corporation ISP products, but IRIX has a some real jewels, and SGI has chosen to give the technology to the open source world. The first was XFS ( imo the best filesystem ever invented )plus some other assorted stuff that SGI is paying it's programming staff to give to us, and now the technology to pinpoint the exact cause of a kernel crash. The IRIX kernel crash postmortem technology is far beyond a mere core dump and pointer - it tries it's best to identify the offending system call, and pid if it can. This release appears to be a port of that technology to Linux.
Stop whining folks - we have just been given one of the best debugging tools ( especially for kernel hackers and device driver writers!! ) in existence as a gift. Try using it, and be sure to thank SGI. After all, even though they have market reasons to do this, they still *did* it.
Yes, every reasonable operating system can be configured to save the core files resultant from a kernel panic to swap, and yes, many provide excellent tools for conducting a post-mortem analysis of the image to diagnose what caused it to croak. But in the past, with the notable exception of IRIX, this process required a fairly intimate knowledge of the operating system and even the underlying hardware, and was considered something of a black art. An excellent book on core dump analysis issues/procedures is 'PANIC!' Unix System Crash Dump Analysis, published by Sunsoft. IRIX, and now Linux when properly configured, automatically conducts the crash dump analysis upon re-entering multi-user, saving a legible and comprehensible report detailing what was going on at the time of the crash and providing a suggestion as to the cause.
This facility can be an excellent way of quickly tracking down the cause of the panic, or at least determining if the problem lay in hardware or software. Below are three examples of some recent reports generated at our site:
Sample 1
Sample 2
Sample 3
While this utility is no replacement for an experienced sysadmin and a debugger when it comes to deciphering the cause of failure in complex systems (especially SMP), it will likely be a boon to the hundreds of thousands of Linux admins supporting small workgroup servers and workstations. And yes, Linux is stable.. but c'mon: kernels panic.
What do you mean by "non-standard"?
That guy has no clue, its obvious.
He runs a default RedHat install, with everything enabled and still running the same kernel that came with it, he has no SCSI devices, or a clue to even know what SCSI is, his largest partition is probably a 15GB Windows98SE partition, and he boots linux to winnuke his non-elite irc friends.
Come on, "Stopping md devices..."? Does he actually use md features? I seriously doubt it. Its just that default RootHat install pushed it down his throat. And apmd? It's kind of pointless on a AC powered system. And it's really pointless to run RedHat on a laptop since even the "Laptop" installation still installs updated whichi will spin your hdd every 5 seconds and make your battery last less than it does in Win95.
I fucking hate ignorant people that Redhat and similar idiotic distributions bring to the world.
Either
a) You're running kernels with odd numbers (development kernels 2.1.x or 2.3.x)
b) You've got bad/unsupported hardware.
c) Your computer is getting bombarded with abnormally high levels of gamma rays...
Yeah, I have to thank SGI for all the major contributions, but I would really like to know what their business model with linux is.. how are they going to make a profit?
oops. I forgot. Companies don't need to show a profit anymore as long as they have a cool URL and do something with linux or the internet. Too bad sgi can't have another IPO...
Or you're just cool like me. :) I have an unnatural ability to crash machines. Except the Win98 box. That thing's been like a rock.
Like I said: unnatural.
Before I got rid of it, my Linux box crashed regularly. No weird hardware or unusual drivers. The most work I'd give it was KDE.
Unnatural.
Eric
Even though they are moving those pages to Win2k, windbg is most likely backward compatible to at least NT 4. If you do try it, you'll need a correct set of symbols for the machine that generated the dump. With those and the dump, it can be viewed on any machine.
HAHAHAHA. Thanks, I needed a good laugh. You've obviously never worked in the industry. UNIX admins make signifigantly more than NT admins, because they are way more valuable. One good UNIX admin can run a lot of machines, which can do a lot more than the same number of NT boxes. The NT admins are too busy getting paged out of bed to reboot the server.
You should have considered placing it under a BSD-style license. Microsoft's feelings are going to be hurt over the fact that they can't incorporate it into Windows 2000.
One of the things I remember back from my days when I was tinkering with SunOS 3.4 and Ultrix and IBM's AOS (not AIX) was that many BSDish Unixes would write what were basically kernel "core dumps" to the swap partition when they died (I may be getting details wrong -- might not have been swap file, might not have been all those OSes, etc.). Sophisticated gurus could then fix things. (Back in those days I was not enough of a guru to do this myself, but I lived and worked with people who were.)
I think it's *wonderful* that a facility like this is coming to Linux. It makes me much more enthusiastic about taking on kernel hacking myself.
But out of fairness I do have to ask... don't the BSDoid operating systems already have this?
And it's a little embarassing to point out that NT has something like this as well.
What happens when the LKCDA crashes during a system crash? Who recovers from that??
Nobody. A crash dumper is going to be a minimal, always-resident program designed to simply copy physical memory to disk. If that can't be done, the system is either fried at the hardware level, or is so far corrupted that a core dump wouldn't mean much anyway.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
I have been thinking about a solution to kernel panics and no-reboot kernel upgrades for a while, and here is the only thing I have come up with that seems viable:
We have redundant power supplies, hard drives, and many other pieces of hardware. I am thinking it may be good for developpers, at any rate, to use redundant kernels. Kernel 1 dies, kernel 2 realises this and kills kernel 1 and takes over the system. Interrupt in service: a few clock cycles. Perhaps a new runlevel should be implemented into the linux kernel...runlevel 7, which would be against the POSIX standard I think, not sure, but would allow a condition in which the kernel is replacing itself in memory, by having a redundant kernel take over while one is being replaced in memory, and the second kernel handing off resources to the new primary kernel when it is ready, returning to the previous runlevel.
The long and the short of what I am saying is that there should be a second kernel in memory at all times ready to take over at any time, but programmed to not run until the first kernel dies or is being upgraded.
The disadvantage: it starts to consume extra memory resources, and process table entries, and will take a long time to perfect.
What do you think?
OFTC: By the community, for the community
This is not a core dump of a running application, but rather, a core dump of the entire running system. If a kernel failure occurs, this patch will dump the contents of system to memory to disk, allowing you to analyize system state from just before when the crash occured.
This would be very useful, for example, when debugging a device driver. It is not something the end-user, or even system administrator, is likely to use. It is for the kernel developer.
Other OSes (Sun Solaris, SGI IRIX, Novell Netware, to name a few) have had this capability, but Linux has not. Linux has traditionally dumped a summary of the kernel state to the screen, but that is (1) tedious to copy down by hand (which you have to, since the system is dead), and (2) not as complete as an entire system image is.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
WinDbg, while probably not redistributable, is a free download from MS. It can read NT kernel dumps. Try here or here. Unfortunately, they're already orienting these places to Win2k.
IRIX isn't Sun's UNIX, it's SGI's UNIX. They're unlikely to sue themselves for stealing an idea from IRIX....
BSD, as others have noted, has had it for ages; many other flavors of UNIX probably got the idea (and, in some if not all cases, the code) from BSD.
Finally this will permit to implement the feature
most requested by Windows users: the Blue Screen of Death.
(P.S. Isn't Linux NOT supposed to crash?)
What jafac was saying is that Microsoft does not give you or offer any low-cost, distributable tools for making sense out of this massive pile of arcane charachters.
Got 128 MB of system memory on a NT workstation? 512 MB on a server? Hope you've got Einstein and a couple years to sort through the thing by hand to find your problem!!
And UnknownSoldier has another very good point: The analysis tools are not cheap, and you can't share!
C'mon Microsoft, didn't you learn anything in kindergarden?
"...America's great minds of today, teaching America's great minds of tomorrow. Poor bastards." -- A Beautiful Min
On the NCR tower this was done by toggling a switch to get into the boot ROM, then choosing appropriate options, then the memory would be written to the dump device.
This helped us diagnose many strange crashes where the system wasn't functioning correctly, but it hadn't actualy paniced, for example, one system had init die, which made logging in a bit hard, but the kernel was still running.
I've missed this feature on more recent PCish hardware, as they don't really have a boot ROM.
Perhaps someone would like to make a more Linux biased BIOS, which could include these sort of nice features.
...reminds me of the ping scheme used for redundant servers;
1. The backup mirrors the main server and pings it at the same time.
2. Ping lost? The backup server assumes the IP of the main server and keeps on going.
3. Administrator alerted, and primary server is fixed, placed in backup server mode. Repeat #1.
I don't think so.
As far as I know, OS/2 has had this for years now.
While it will tell you to write down the information which it dumps to screen (stupid!), it actually also saves a copy to disk.
The only time I've had the pleasure of this experience was when I fried my mobo...
I hope SGI would be smart enough to do this "clean room." If not, SCO (not Sun, not HP, not IBM, not Compaq, not GNU, not AT&T, not Novell) could sue them. SCO owns the UNIX source code.
Let me know (of course you will) if I'm wrong on this.
--Al
my kernel crashed!
I guess it's great for development, though..
Kernel Panic: Linux Kernel Crash Dump Subsystem received signal 11. giving up.
--- Sueños del Sur - a webcomic about four young siblings
I wonder how they tested their software.. considering linux crashes so rarely. *rimshot*
--
Also there is pstack. Is there a linux debugging page? If not there needs to be one.
Sounds like a good idea... although windows could use this more, lol. But seriously, i have noticed some problems when it freezes (yes it has happened to me, but rarely) i have no idea why. [coughnetscape].
If we fail, we will lose the war.
Had to do it lol
Restating the obvious since nineteen aught five.
I talked to Stephen at the Expo in London and
is not his intention to push this into 2.3. So
unless he (and Linus) changes his mind, it won't
be going into 2.3.
The description says that it saves the dump to a SCSI partition. What happens if you're running IDE?
;-)
I think the idea is pretty cool -- no more trying to figure out why ksymoops didn't grok what you hastily scribbled down. I suppose all the hardcore kernel hackers will cry "Sacrilege!" though.
P.S. Sun won't sue for stealing their crash dump idea, right?
Glückwünsche, haben Sie Slashdot ermordet, indem Sie zum korporativen Druck beugten und Subskriptionen einlei
Another comment - I know that reiserfs is being submitted for 2.3, but nobody knows whether Linus decides that it goes in at this time. It would be very nice to have for 2.4, though.
And 2.3 is not as simple as you think - although it is "just" ext2 with a journal, you have to consider stuff write ordering, for instance.
Forgive my ignorance, but I'm not to educated on the "under the hood" stuff in *nix environments, but is this a new thing? I've noticed that under Solaris that I've got core files after a crash. Are these the same type of thing or do they not apply to the kernal? If not what are core files for? Do they have any use beyond cluttering up directories?
why is this a good thing?? it makes no sense. SGI knows nothing about making enterprise class operating systems, but they sure do know alot about making operating systems crash...
their hardware sucks, their contributions to linux have been pathetic. alan cox/linus are definately more adept at coding, OS scalability, and other issues involved with stuff of this nature.
and of course the *PROVEN* fact that linux does not crash, period, the end. it never has, it never will. solaris, irix, BSD all *WISH* they had the scalability and reliability that Linux has already achieved.
forget SGI, we dont need their lousy contributions, they are just leeches on the open source revolution.
LiNuX MaN
This is good... lets you see just what went wrong when the server went down for the first time in 2 years. Should make for finding the bad programs that do bring linux down.
Let me state right up front that I am a dedicated Linux advocate. Having that out of the way I must ask WTF you're smoking? Linux is damn cool but your post is so devoid of reality that it looks like the ravings of a fscking idiot
Having worked in a Solaris shop, you can see the value of having crash dumps to send to your vendor.
Actually, we were the vendor and got crash dumps from customers that was able to pinpoint very quickly what the problem was. Once that was found, it was easy to fix. Without the crash dumps, it could take weeks to find the cause of a nasty bug. Especially intermittent ones.
With Linux having this feature, it'll be easier for driver authors to debug their code, and most likely boost the confidence of customers who want 99.999% uptime.
-- Ever notice that fast-burning fuse looks exactly the same as slow-burning fuse? I didn't... (Edgar Montrose)
Then you get a "double panic", a very cryptic message, and no crash dump. Very rare, but it can happen.
At least, that's how *BSD handles it. "double panic" is engineereese for "fix your broken hardware".
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
Has anyone said it was a new thing. Linux just lacked a kernel debugger and now it has one. Linux still lacks a journaled file system, and will eventually have one. Nobody's saying it's new, but it's still a reason to be happy for.
Opus: the Swiss army knife of audio codec
We will then know, where it crashed. What is more important, is why it crashed! By the time it crashed, the real cause of the crash could be 10 layers away.
Any more information will help though.
What would be better, is not crashing. :) Linux crashing? Is that a Windows emulation feature?
Injured software engineer wins against Mattel!
We will then know, where it crashed. What is more important, is why it crashed! By the time it crashed, the real cause of the crash could be 10 layers away.
Any more information will help though.
What would be better, is not crashing. :) Linux crashing? Is that a Windows emulation feature?
Injured software engineer wins against Mattel!
I don't normally have problems with crashes either. Currently, however, I am working on kernel modules for solaris. Until I learned how to use adb on the kernel crash dump, debugging was impossible. Now it is relatively easy, just use adb -k unix.0 vmcore.0 and $c will show you the call stack. This works great for debugging kernel level drivers and modules. I can't wait to try this under linux!
--
Mike Mangino Consultant, Analysts International
Mike Mangino
mmangino@acm.org
I think this is a great detail that's needed for enterprise stuff. What about the Sun truss command?
Rather amusingly (and as you can see from a few dozen posts up above), NT actually does have this ability.
Hey AC, why don't you get your facts right ?
1 - the SGi hardware is amazing. They make some of the finest machines available ( Octane 2000, O2 ), and they achieve a level of parallelism that Linux still dreams about ( in an allways undervalued Irix machine )
2 - their contributions to Linux have been very good and well received. KDBG is a very useful tool, and coupled with LKCD will make kernel and driver development a lot easier. GLX iwill be used in XFree86 4.0, they're working in the Linux for Merced port, etc.
3 - OS scalability: have you ever heard of IRIX and its support of more than 64 CPUs ?
4 - PROVEN fact that Linux never crashes is bullshit. It's just an OS like any other that can also crash, in partiuclar during development releases. I'm doing multiprocessor research on it and today I made it crash twice. I like Linux very much, but I try to keep my eyes open.
Finally, what have you done for Linux lately ? SGI has been supporting Linux constantly during this last year, and they don't deserve to be treated this way ( remember, there're people hard working there that contributes their code under the GPL )
Enough, you don't even deserve my time answering your stupid post. Go back to your perl scripts ( or was it VB ? ).
Whoa! This is great! I'm so happy for Microsoft. I'ts about time the government let them break up, instead of forcing them to remain a monopoly.
I just hope this means my phone bill will go down.
The resource kit includes tools to interpret the core dump and regurgitate the BSOD contents (which, BTW, almost always points to a video driver file). If that isn't good enough go over to www.sysinternals.com where there is a utility that saves the screen contents specifically.
There are "checked/debug" versions of all MS OS' that any MS developer (belonging to the MSDN subscription gets), including 2000. This includes the entire symbol table, etc. Additionally, the BSOD's are virtually ALWAYS a third party driver running in protected mode that decided to take down the party, meaning the information is critically valuable.
Ironically your "NT user" impression leaves you looking far like your average slackjawed Linux yokel looking for some friends by joining the cult.
I've got this one computer at home that crashes EVERY time it hits runlevel 0. I think it's got somehting to do with apmd, but anyways, it's one of those computers that can turn themselves off through software. Normally, the last thing you see is:
Stopping all md devices.
System is halted.
Power down.
And at that point, you either have to turn it off yourself, or the software (apmd?) does it for you. Well this one box I have (a crappy HP I got for free) gets right to the words "power down" and then it dumps all sorts of crap onto the screen, including the values in the CPU's registers, and what I assume to be some crap from memory. What I'm thinking here though, is that since all the filesystems are already unmounted, LKCD wouldn't make a lick of difference for me. Am I right in assuming this?
-- My neighbors dog has a four inch clit.
Way to go, guys! Welcome to the 20th century!
Perhaps you'll even eventually make it to the 21st.
At least two journaled file systems will be in 2.4. Reiserfs and ext3 should both make it in. Posibly XFS as well (not heard any news about that one). Hey, and 64GB max memory in 2.4 as well (still 4GB max per process though).
Plato seems wrong to me today
" Memory dump files are created when a STOP error occurs, and the system is set to save debug information in the 'Startup/Shutdown' tab of the 'System' Control Panel."
source: support.microsoft.com
This is a real easy one to setup. The feature's not usually used on small workgroup servers because there's usually no one around who can do anything with a 256MB binary. I was going to say a lot of nasty things about dumb NT admins, but I thought I'd be nice as I was one (and will be again if the money's right).
It's better to be uninformed than misinformed.
_damnit_
_damnit_
It's my job to freeze you. -- Logan's Run
" Memory dump files are created when a STOP error occurs, and the system is set to save debug information in the 'Startup/Shutdown' tab of the 'System' Control Panel."
source: support.microsoft.com
_damnit_
_damnit_
It's my job to freeze you. -- Logan's Run
That wouldn't really solve the problem -- obviously, someone has a cronjob that runs a script to scan for new /. stories.
Naturally, there are people and things we'd rather not deal with, but just like IRL, it's unavoidable. If that thought is too traumatic to deal with, you have two options:
Any kind of censorship (including an IP ban) is bad, bad, bad -- but what do I know? I'm just as much a part of the problem as anyone else.
--
E2 IN2 IE?
A good hacker should be able to do with just a register dump, stack trace and some program text surrounding the instruction pointer where things went belly up.
Hacking the kernel is supposed to be hard and tracing crashes given minimal information is a big part of the fun and attraction of ``iron man'' programming.
Then again, having a full dump doesn't necessarily make debugging that much easier. It's an incremental improvement over oops text.
Here is the real advantage: a dump is good from the point of view of users who need to report crashes to developers. I think that even a hack to get oops text (rather than a full dump) written to a partition would be better than asking the poor user to copy the oops text appearing on the frozen console down on a piece of paper! Forget it!
You need TWO PeeCees hooked together by serial
port. Then you put one computer in "debug boot
mode", and control the debugger using the other.
Feh.
On Solaris, you just grab the core and symbol
files, and use adb. On just one computer, with
no special boot modes, with the machine running
whatever.
Having this ability on linux will be very very
convenient.
yup once, accessing a floppy drive, dont know why, dont want to know why cuz it's never happened since :)
It seems like a fancier and easier to use method for dealing with kernel "oops" files. The kernel source tree has always had instructions on how to debug a kernel crash.
What else is different about this new SGI stuff?
Never seen a panic, but I did once see an Oops. I thought that was funny, the documentation said oopses were all but nonexistant. If I knew what did it (the box was a server and i didn't see the oops text until a week later) and could reproduce it, I'd have reported.
I've personally never seen a userland program crash the Linux kernel. The closest I've come is having bugs in the X server lock up the keyboard and display, but the machine was still running fine in all other regards, and I was able to telnet in and initiate a clean reboot.
Couldn't agree more, we have a farm of O2s and they crash like hell, by Unix standard that is; almost one crash per month. We had some code that would systematically crash machines with panic error. I just hope that what SGI has to contribute to Linux is not instability!!!!
strace -p pid
Hope this helps...
--- polarbear
The best part isn't this particular feature (although it's a good thing). The important point to take away from this is that companies with a vested interest in particular markets are contributing to Linux.
SGI has a lot of expertise in building enterprise-class software and you can bet that there's more good stuff to come. Corel is doing interesting work on the user interface and will probably contribute lots of neat stuff to Debian. These are companies that would never collaborate directly on a product, but through Linux they end up contributing to each other and to the market as a whole.
We live in interesting times...
This software does not allow a debugger to operate on a dump file; instead they introduced a new program (lcrash) which allows the user to "interact" with the image. And this new program must be recompiled every time the kernel changes. Say hello to version skew! If it took Linux this long to get to this point, who knows how long before they'll actually be able to use a real debugger.
NT has had this for years, where have you been? Hiding under the Unix rock? Seriously, if more Unix people would actually try NT, they'd realize there is no need for Unix anymore :-)
As someone who has been doing Linux device driver development for about a year and gotten annoyed at the lack of kernel development tools, it's really nice to see this. Now if only Linus would make Andrea Arcangeli's Intergrated Kernel Debugger a part of the standard tree, it would make day.
--
This comment is (©) Copyright Deepak Saxena.
Deepak Saxena
"Computers are useless, they can only give you answers" - Picasso
This is a good thing, but it is part of a more general problem.
And that problem is that we accept tools for Linux development that are distinctly sub par. There is a lot that could, and should, be done.
I would say more, but I cannot possibly say it better than this rant does.
Cheers,
Ben
PS The Microsoft program works right and has a bad interface, the Linux program has a nice interface but sucks! Whodathunkit? (Read the link.)
My usual seat in the cluetrain is at A HREF="http://pub4.ezboard.com/biwethey.ht
Having done just a very quick glance over the specs I may be wrong, but I believe they are doing what they have been doing on the SGI for awhile. When a SGI running a newer flavor of IRIX does a system panic (SCSI, memory, whatever) it dumps a core out. Dumping this file is not for the drivespace week, if you have half a gig or ram you have a half a gig core file, but the beauty of this is it then automatically examines the core file and tries to figure out what killed it, you don't have to go in and run the debugger yourself.
Having the machine tell you what memory page you were at when it took a dive makes life much nicer for the harried admin; of course if you want to dig through a core at a later time with your debugger you can but it gives you a good starting point, and tends to make tracking things down much quicker since you have a guess as to where the problem resides. Having your box tell you that you had a memory error in SIM 3 bringing the box down, having analysed the core file before you even have a chance to fire up your debugger, is a pretty nice thing.
Of course this is dependant upon my assumption that it works in the same kind of fashion as Irix (which it seems to).
There are kernel debuggers for the linux kernel, linus just doesn't include them in the standart kernel.
one is ftp://e-mind.com/pub/andrea/kernel-patches/ikd/
and another one is from SGI: http://oss.sgi.com/projects/kdb/
duh i'm sure he knows irix is sgi's he's saying sgi already had it so sun can't sue
Linux will probably be a "modern" OS when you can debug a live system (either remotely or via an in-kernel debugger). This is just one aspect where Linux shows it's (lack of) maturity. *BSD is so much nicer to develop for.
>What exactly are you supposed to do with a kernel core dump under a closed source OS? Your supposed to put on aluminum foil suits and dance around.
One of the most recent times NT crashed on me (I am not running it at present) the BSOD (which, you might be interested to know, is a debugger's dump screen) contained enough information to debug what was the problem.
That isn't always the case, of course, and there isn't a real easy way to record the info on a BSOD screen (a large 'scope camera?)
Actually in the support directory on the CD's are both the kernal symbols and i386 kd. Now there's not a lot of documentation on this on the Cd, but if you buy the Book Inside Windows NT you will get an introduction to the kernal debugger. (Also some of this information is in the Device Driver kit). If you are supporting NT, you need at the Platform SDK and the DDK kit.
But a fat lot of good a kernel debugger does you on a closed-source OS.
NT had the future almost in its grasp, but let it slip away by being impossibly unreliable and horribly admin-unfriendly compared to any Unix product. [We worked with it for a year but eventually had to discard it as a worthless toy.)
But that was then. Now it's just plain obsolete. Face it.
"The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
Microsoft learned something in Kindergarten. The teacher said "You have to bring enough for everybody" so it stopped bringing anything interesting for show-and-tell.
In third grade, Microsoft learned that the teacher who said "I am holding you back because the rest of the class needs to catch up" really was serious.
In seventh grade it learned in "social studies" class that "we owe it to those less fortunate to ourselves to share what we have."
By eight grade it realized the teachers were just not very bright people. It read in a book somewhere (off campus) the maxim "If you can't do it, teach it."
Uh....there's a kernel debugger for Linux, isn't there? Please say yes.....
I'm really surprised Linux didn't already have this. I haven't done any kernel stuff for Linux, so it never occurred to me that it didn't already exist.
There's a kernel debugger, right? Hit a special key sequence and the whole system stops, so you can look around at data structures? Is there? Someone please say yes...
IIRC, Linus is against this whole idea for a long time, and i think i agree with him. The resource crunching is simply not worth the feature. Ooops tracing is just as good. I dont think this should be in the kernel but maybe as a separate patch maintained by SGI.
"I've personally never seen a userland program crash the Linux kernel."
Hah - The very first day I used Linux (RedHat 6.0) the Gimp rebooted my machine twice! That was some two months ago by now, but had I had that crash dump program I might have contributed in a more informative way than merely posting my ignorant rantings.
can you say "flamebait"?
-dilinger, who forgot his slashdot passwd 2 years ago
Where was everyone's brain on this? NT has done this for a long time, the dump can be found in the page file.
Finally a way to find out why X is always crashing...
--- Anonymous Coward
Console apps blow!
-- Abigail
Maybe you're an unusually strong source of gamma radiation :-)
Choice of masters is not freedom.
NT saves it in swap as long as the swap partition is on the same drive as the boot partition. Again, RTFM. The serial port option is only an option and is NOT the default or usual manner for memory dumps.
As someone else in this thread commented, the dump is of the entire contents of memory. This changes in Win2000, but I have not personally seen this.
_damnit_
_damnit_
It's my job to freeze you. -- Logan's Run
Look into the KDE screensaver.
the BSOD is alive and well on my FreeBSD desktop.
-bugg
Saving the memory image is where the similarity ends. In *BSD,Solaris, NT, etc. you also have a *real* debugger to operate on the crash dump with. In linux, you have a half-ass tool that lets you inspect structures and needs to be recompiled EVERY TIME YOUR KERNEL CHANGES. It's a step forward for Linux, but it's still extremely primitive when compared to the kernel development environment on other OS's.
hm. If I was granted moderator points more often than once every 9 months, I'd moderate you up.
This is JUST what I'm looking for, well, it answers most of my complaints. Unfortunately, it does seem to be W2K only, not NT 4, which is likely to represent 99% of my install base for well past 12 months. (I seriously doubt that there will be any significant migration to W2K until this time next year. Oh, there will be a few early adopters, but among MY customers - almost nobody plans on putting it into production).
I wish I had a nickel for every time someone said "Information wants to be free".
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
Im Glad to see that this is on LINUX because EVERYONNE nos that LINUX is total bugy! Now you can SEE Y! i know why my NT dont need ths STUF because it Never crashes1!!1 Thats y al hte other NIXs got THIS! because THEY suck 2!1! NT ROOLS YOU LINUX SHTS! AND BSD SUX!
http://softick.8m.com/
Possibly others would work as well. Check out http://www.suddendischarge.com/debugg ers.html for just about every (free/shareware) debugger ever made.
However, Linux is wrapped around a poor non-standard TCP/IP stack, which won't change anytime soon.
Well, ext3 is fairly simply, on disk it is ext2 with a log file (its self a regular file), reiserfs is coming along. What goes in may not be feature complete, but from the word on the reiserfs mailing list is that they both will make it into 2.4, and should be in 2.3 fairly soon. Perhaps they will be flagged experimental in the early 2.4 kernels.
Plato seems wrong to me today
> What exactly are you supposed to do with a kernel core dump under a closed source OS?
Figure out what application was running when your system hung, tell your support provider, and get them to fix it.
This is great.
I think SGI is going to more for linux than most people expect. They are helping us move into the Enterprise so much faster than I ever thought possible. You should look at their web page and see all the code they have contributed, it is very nice. SGI may be strugling, but they have a large cash reserve, and are staking their existance on Linux.
I hope they succed and will personally see that I get as many SGI servers around here as possible
geach
Sorry to disappoint you, but not only are you not first, but the previous _six_ posts (a new record for Slashdot!) are not lame "First Post" posts.
Glückwünsche, haben Sie Slashdot ermordet, indem Sie zum korporativen Druck beugten und Subskriptionen einlei
http://slashdot.org/users.pl?op=userinfo&nick=meta wronka
wow..
Those SGI guys are experts on crashing the kernel. Maybe that's why they came up with the XFS file system! At least if you're going to crash every week, you don't have fsck everytime.
The NetApp is a BSD machine. NetBSD if I'm not mistaking. Heavily modified of course, but deep beneath there's *BSD in the beast.
NOTICE: Starting XFS recovery on filesystem: / (dev: 0/79)
Hey you guys have XFS working on Linux already? Sweet! Can't wait to see it on my box someday soon. Excellent work guys...
I think that this was one of the greatest features of Novell - the fact that if your server was barfing, you could go into the debugger, and neuter an offending process; or if the server was really in trouble, it would drop into the debugger, so you could at least figure out what went wrong, or dump the memory image and send it to someone who could.
And also, it's one of the things I really, really, really HATE about NT. No debugger comes with the OS, and there's no free, distributable one out there, so from a tech support standpoint, if your customer's server barfs, you kind of have to guess at what went wrong, or establish a pattern from multiple calls, or try to reproduce it in-house. Switching from supporting Netware products to NT products has been hell, and this is 90% of the reason. This kind of thing in Linux can only help "the cause". (and because my company is working on some fairly significant Linux products, and I may end up supporting them, this makes me more optimistic about the future.)
I wish I had a nickel for every time someone said "Information wants to be free".
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
Hmmm, I thought Linux never crashed? What does this dump analyzer do?
this guy probably running a script to post 'FIRST POST!!!' every time. or just plain _cOUgH_eGGhEad_cOUgH_.
--
You're a cartoon of rebel! You're all like exaggerated version of yourself! - Gerard Jones
Why don't you write them a letter? Better, yet, why not drop by SGI and volunteer to ``help them out''. I'm sure a start-up like SGI could use volunteer ``experts'' like yourself. Maybe you could offer to do all their legal footwork.
My machine did this exact same thing ... it had something to do with a buggy BIOS, so if you're using an Epox motherboard you might want to see this page: http://www.epox.com/support/bios.html
Being able to view a dump of the memory at the state it was in when the crash occured is an invaluable piece of data for any developer, and/or support-type person.
It makes the PTF/patch process go so much more easier. Of course, that's when the stack in the dump hasn't been corrupted by "unexpected" behaviour. Then, all bets are off.
BTW, does anyone know if they have any tools tailored to viewing these dumps and being able to quickly navigate through the stack, popping/pushing when needed? That would be nice too, but I never noticed anything from the announcement, and I don't have a Linux system handy at the moment to check the package.
> After all, ROM wasn't built in a day. :-)
Wich ROM? 8)
I have the strong feeling that some ROMs of my hardware were coded in less than a single day...
You are mistaken. NetApp boxes do *N*O*T* run any flavor of BSD as their OS.
The underlying kernel is one written at NetApp; it doesn't support multiple address spaces, any notion of userland, or demand paging (heck, until recently, it didn't even change the page tables; it now uses the paging hardware, but only to make virtually contiguous physically-discontiguous pages, to make allocation of large chunks of memory a bit less painful).
A significant part of the of the code did from BSD - the networking stack came from BSD (4.4-Lite, with some bits of the FreeBSD and NetBSD stacks thrown in), as did many of the commands (although those had to be chainsawed a bit to run in kernel mode in a shared address space), as did the dump and restore code (although the dump code was significantly changed to work with our WAFL file system). Various support routines also came from various BSDs as well, and the NFS server code is somewhat remotely derived from the BSD code (although it was also significantly changed to fit into our environment as well).
However, that doesn't mean NetApp boxes run anything you'd recognize as "BSD" (and, in particular, the crash-dumping code isn't BSD-derived, although the savecore command is based on the BSD command, although, again, significantly modified to run in our environment, and to extract the core dump information from the core dump areas on the disks).
(Yes, I know this first hand. I'm one of the developers there, and have been since early 1994.)
Meta
You must be joking.
:)
Hey, GUYS!
THIS IS A SYSTEM BASED ON GNU TOOLS!
YOU HAVE A DEBUGGER WHICH HAS SPECIAL HOOKS FOR DEBUGGING KERNEL CORE DUMPS!
This is *crazy*! That's like, uhm, like sort of a hack perpetrated by someone who was in a hurry and didn't know about prior art.
Which, I guess, is the allegation that Linux always faces from people. *sigh*. Oh, well, it'll get better.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
send flames > /dev/null
Only 'flamers' flame!
The most recent Linux 2.3.25 kernel does not have ext3. ext3 is still way alpha. Linux 2.3 is already under feature freeze. If Linus plans to release Linux 2.4 by 2000 Q1, I doubt ext3 will be part of it.
cpeterso
People here are saying that yes, even NT has the ability to dump kernel core when it BSODs, but:
What exactly are you supposed to do with a kernel core dump under a closed source OS? Throw a printout of it into a bonfire to propitiate the Windows Demons? Send it to Microsoft and wait for their rigorous QA process to leap into action and send you a fixed kernel? I can't imagine trying to debug it yourself without being able to get a backtrace and look at the problem source code. Does Microsoft even leave a symbol table of internal function names in the NT kernel? What exactly do you do with a Kernel Debugger in Solaris if you can't see anything more than what a disassembler will tell you about the kernel being debugged?
One other point to note: there are a number of features still to add to reduce the size of the memory image and to speed up the dumping process. We're also looking to make NMIs work properly under Linux. Stay tuned.
--Matt
I was just thinking the other day that running Linux in a VM in Linux would be handy for, among other things, somewhat more secure network services (Mail, web services, etc.) Run your server in a virtual machine and who cares if it gets cracked? Just wipe it clean and reinstall (Once you get it the way you like it, you could write the image to CD and just restore from CD every time you reboot.)
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
The core files you're seeing save the segment of memory in which the program was running. They can be used in conjunction with a debugger and image with debugging information to recreate the state of the application when it crashed, enabling the programmer to glean information about which instruction caused the crash.
Dumping the kernel on a crash is not new but it is useful, in much the same way.
Under HP-UX, as far as I remember, when the kernel crashes it is dumped into the swap device starting backwards from the end of swap. One of the first actions of the boot sequence (and boy can that take a long time) is to check whether there is a kernel image written in swap. If so, it's copied out and can be sent back to the kernel team for investigation.
Of course, if your boot sequence doesn't copy out the kernel, you've got a finite time to get it out yourself before it's overwritten by the ever-advancing swap data.
-John
This makes very little sense. The purpose of the kernel is to be the thing which allocates resources. If it screwed up, how do you "recover"?
Do both kernels have access to the serial ports? Does only one? If it's only one, how do you guess what state the port is in when it dies and the other takes over? If it's both, how do you keep them from conflicting?
It turns out that the program which keeps the kernels from conflicting is, in fact, the kernel.
This mind is not buddha.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
I think that this is excellent news. And more so for Linux than any other system.
It is crucially important that a community project like Linux have good debugging tools, both from the perspective of quality control, and to encourage others to get involved in the community.
Other systems that are open but don't actively encourage contributions, or worse yet are closed - well, these debuggers are usefull in the sense that it helps pin point a problem. But in many cases you don't have control of the source code, so there isn't much you can do except mail it to the developers. If they even have a place to mail it to.
In many Commercial UNIX systems, the proc filesystem isn't as broad as that of linux and the low-level tuning parameters are configured using the Kernel debugger. In some cases, the kernel debugger is the general program debugger with the ability to traverse /dev/kmem - and some functions to manipulate the appropriate data structures.
Debuggers aren't only used for debugging.
But as it is, what's the point. Has anyone ever actually SEEN a kernel panic?
So if you play with a x.(2y+1).z kernel while rubbing your feet on the carpet and a lightning rod attached to an ISA slot, then this is for you. If you only use a x.(2y).z kernel with z>2, then this'll probably do nothing more than occupy disk space.
Christopher A. Bohn
cb
Oooh! What does this button do!?
Just about every other OS I know of (except for NT) includes this. Having a Kernel Debugger, Kernel Core Dump, and a few other tools available over the past few *YEARS* has saved me a lot of hassle. If Linux hasn't had this till now, I'm sooooooooooo sorry. Thats really dissapointing.
*BSD, Solaris, Dynix, and bazillions of other OS'es have had this ever since they were created.
This sounds like just about the coolest software utility that I will almost never have to use!
A Windows version would be orders of magnatude more useful!
What happens when the LKCDA crashes during a system crash? Who recovers from that??
Can your IM do this?
Forget knowing why my system crashed, I just want my BSOD. They better have someone working on bringing that to Linux...
This sounds like just about the coolest software utility that I will almost never have to use! [Grin]
A Windows version would be orders of magnatude more useful! [Bigger grin!]
This is really cool, and could save people lots of time and engery ... At one Place I worked for that was using SCO we we're having problems with something causing massive core dumps, so we sent the dump off to SCO and a few days latter they sent us back a message describing excatly what was wrong with the system. It was something like, On line 10697 you'll see four dollar signs, that's a problem with the Network card... It's good to see something like that for linux now.
OK, I could see as how it might help the developers... ;-)
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
This will be tremendously usefull for us device driver writers, and all other breeds of kernel-hackers. True, Linux rarely crashes under normal use, but when your code is running with the kernel and you make a mistake... OOPS!
For some examples, see my PIC page.
There's a kernel debugger, right? Hit a special key sequence and the whole system stops, so you can look around at data structures? Is there? Someone please say yes...
I'm sorry, but this made me laugh out-loud.
I too am curious if there is some sort of kdb under linux. This sort of thing was avaliable for SCO when I was doing driver maintenance for it at my last job (but I barely knew how to use a kernel debugger at that time). I've been sysadmining IRIX at my current job, and I've been looking for a kernel debugger and the key sequence to get into it, but haven't found it yet. (Admittedly, I haven't been looking too hard, but I'm curious nonetheless).
Catcha' later,
Paul.
Well, as a data point, BSD/OS (www.bsdi.com) is arguably "closed source", and gives out crash dumps.
I'm in the support group, and we find crash dumps *very* valuable. It is not necessary for the customer to necessarily have all the source, just a kernel with known characteristics...
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
Then again, I pulled the IDE cable out of the drive when it was running. I bet that dident help the situation. Other then that, a 2.1.132 kernel mysteriously locked up on me, but that was a develoment kernel, and the machine had a 70+ day uptime. Also, I tried one of the latest 2.3.x kernels, and it crashed on me detecting the USB controler, but then I expected that :)
I had repeating sig 11 crashes when ever I compilied something or did something cpu intensive like a find | filename. I took the ram speed down in my bios and after that I took my case off and the problem vanished.
I'd support a block on your IP address if I thought it'd make you quit whining.
Thats right IBM/Speery/Burroughs/GE/ICL had it. To catch real bugs, and hardware errors. Of course written to the raw to disk.. But now wittten to memory - after checking for defined events (known psw values), and taking defined actions (ignore, dump, kill x, or user code), plus IPCS to read the dump afterwards. The dump program formats plenty, and between stacks and pointer chains, you have a fair idea of whodidit. Of course you have SADUMP, so you can read the dump without the OS, or boot a mini system . All good stuff - except MVS is now 99.999% reliable. I dont think MS has a parmlib, with a list of paramaters about what user definable actions to take. IBM also has slip traps - a kinda softice (but hardware ice). You set a trap or event, and get a dump when it occurs. For this reason, no bug survives on MVS.
The serial hookup trick sounds like debugging an Amiga circa 1990.
Uh, yes at least 4 times in the past 6 months.
"Oh, you BSODed? Did you install Service Pack 5?"
"You did. Well, what applications were you running?"
"Just a Q3Test server? Okay, send me the core dump and I'll check it out." (yeah right!)
Few minutes go by
"Okay, we've thoroughly analysed the data. It seems that NT crashed."
"Oh, you know it crashed? Yeah, right! You told me it crashed. Forgot about that. Hmmmmm... Have you tried rebooting? That usually works."
"It works now? Great. That'll be $3500 payable to Microsoft Inc."
"You want to know why NT crashed? I'm sorry, we cannot do that. That is proprietary information. We seem to have solved your problem, however. Your computer seems to be working fine. We expect payment within two weeks. Have a nice day."
"Evil will always triumph over good, because good is dumb." - Dark Helmet (Spaceballs)
--
--
This sounds *EXACTLY* like the way BSD kernels have, since the dawn of time, handled panics. If you have enough swap space, the kernel dumps a complete core image (in a special format) to the swap device. Then, on boot, it extracts it before enabling swap, and copies a kernel over. (Goes in /var/crash, if such a place exists.)
I've used this to debug (or have someone else debug) kernel panics on BSD/OS and NetBSD systems. It's a *very* nice feature, because, in the real world, you often have a crash that can't be encouraged to happen right when the engineer is handy.
Common feature, been available for years. I just *assumed* Linux had it.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
Wow - Karma = -30?? Not exactly adding intelligent insight to stories... On a related note, would Rob or whoever be willing to block his IP address from commenting or something? I mean, with 22 comments and nothing but "First Post"...would people be willing to support this? Just curious...
Windows NT 4.0 does a very similar thing when the appropriate options are checked in the Startup and Recovery options under the System object in Control Panel. Problem is...it's the ENTIRE memory space. If you have 128 MB of RAM, you better have 128 MB of swapfile space on the system partition. Not a smart thing to do when the boot partition (where the \WINNT directory is) resides in the same partition...the hard disk has to constantly shuffle btw the swap file and the \WINNT directory. If you place the swap file on a different partition (i.e. optimize for I/O speed), the crash dump file (memory.dmp) is not created when NT bluescreens. This particular thing that SGI's doing is a MUCH smarter way of going about it. Though one of the coolest things about Win2K is the fact that you can choose btw a full mem dump, a kernel mem dump, and a 64K minidump. That's a Good Thing for those of us who like to optimize our swap file and move it to a different partition or split it up a bit.
However...sifting through that crap with the dumpchk.exe and dumpexam.exe utilities is akin to getting your teeth pulled...:)
Another nifty thing NT has is the ability to t-shoot a box by hooking up another NT box to it thru the serial port (or remotely, with a modem) and, by using the symbol files, find out EXACTLY where in the OS code a particular process is failing, because when NT bluescreens, it's not really crashed...the kernel is still spinning happily away churning out that dump file. That ain't too bad, but it's a bitch to set up.
I prefer just to decipher the bluescreen and find out which piece of shit hardware (or driver) is causing the failure...:)
-Kevin Bunn, MCSE/MCT - MCP ID # 1198191
PS: Yes, you heard it correctly. The way NT does it, the BOOT partition is where the system files (i.e. \WINNT) are and the SYSTEM partition is where the boot files (i.e. boot.ini, ntldr, and ntdetect.com) are. Another weird MS-ism...:P
My posts don't reflect the opinion of my employer, and my employer's opinion doesn't influence the content of my posts.