Slashdot Mirror


SGI announces Linux Kernel Crash Dumps (LKCD)

Alphix writes "SGI has announced their Linux Kernel Crash Dumps project - and it's gone to release. It's intended to simplify the examination of system crashes thru saving the kernel memory image when the system dies due to a software failure, recovering the kernel memory image when the system is rebooted and then examining the memory image to determine what happened when the failure occurred."

206 comments

  1. Re:Netware by UnknownSoldier · · Score: 1

    > And also, it's one of the things I really, really, really HATE about NT. No debugger comes with the OS, and there's no free, distributable one out there, so from a tech support standpoint, if your customer's server barfs, you kind of have to guess at what went wrong, or establish a pattern from multiple calls, or try to reproduce it in-house.

    Yeah it sucks. The only solution that I'm aware of is to get your customer(s) to install MS Dev Studio, or even NuMega's SoftIce. Not very practical, but its better then nothing.

    Cheers

  2. Re:Is this a new thing or just new to SGI? by seebs · · Score: 1

    I think you misunderstood him. Solaris (I assume) has the ability to dump core *for the kernel*. Obviously, not into the filesystem - thus the swap/savecore dance.

    And no, it's not only for applications. And it's *very* useful.

    --
    My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
  3. Re:Do we really need this? by LocalYokel · · Score: 1

    Crucify me for saying this, but I have identical results with NT as both a workstation and a server... Your mileage may vary.

    --

    --
    E2 IN2 IE?

  4. Re:Not JUST a core dumper by OldTechnoFreak · · Score: 1

    As you note, this is far more than a mere Kernel core dumper. I know this site attracts many professional developers and sysadmins, but there are far more who have never had the pleasure of driving IRIX. Linux is really good, given it's maturity level, and I use it at work and home - I develop for it at work for our products : see Ariel Corporation ISP products, but IRIX has a some real jewels, and SGI has chosen to give the technology to the open source world. The first was XFS ( imo the best filesystem ever invented )plus some other assorted stuff that SGI is paying it's programming staff to give to us, and now the technology to pinpoint the exact cause of a kernel crash. The IRIX kernel crash postmortem technology is far beyond a mere core dump and pointer - it tries it's best to identify the offending system call, and pid if it can. This release appears to be a port of that technology to Linux.
    Stop whining folks - we have just been given one of the best debugging tools ( especially for kernel hackers and device driver writers!! ) in existence as a gift. Try using it, and be sure to thank SGI. After all, even though they have market reasons to do this, they still *did* it.

  5. So, the value is... by dmercer · · Score: 3
    From some of the posts above, I gather that there is some confusion about the significance of the functionality being provided by SGI with LKCD.

    Yes, every reasonable operating system can be configured to save the core files resultant from a kernel panic to swap, and yes, many provide excellent tools for conducting a post-mortem analysis of the image to diagnose what caused it to croak. But in the past, with the notable exception of IRIX, this process required a fairly intimate knowledge of the operating system and even the underlying hardware, and was considered something of a black art. An excellent book on core dump analysis issues/procedures is 'PANIC!' Unix System Crash Dump Analysis, published by Sunsoft. IRIX, and now Linux when properly configured, automatically conducts the crash dump analysis upon re-entering multi-user, saving a legible and comprehensible report detailing what was going on at the time of the crash and providing a suggestion as to the cause.

    This facility can be an excellent way of quickly tracking down the cause of the panic, or at least determining if the problem lay in hardware or software. Below are three examples of some recent reports generated at our site:

    Sample 1

    Sample 2

    Sample 3

    While this utility is no replacement for an experienced sysadmin and a debugger when it comes to deciphering the cause of failure in complex systems (especially SMP), it will likely be a boon to the hundreds of thousands of Linux admins supporting small workgroup servers and workstations. And yes, Linux is stable.. but c'mon: kernels panic.

    1. Re:So, the value is... by seebs · · Score: 1

      BSD/OS has had a similar feature. *completely* automatic? No.

      cd /var/crash
      /sys/scripts/kanal 0

      And you get a file called 'info.0' which is a nice summary of everything important to ftp up to BSDI's support group.

      --
      My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
    2. Re:So, the value is... by Anonymous Coward · · Score: 1

      This IRIX stuff is baloney. Sun has ISCDA which does that and more. But since you know about the PANIC book, you probably know about this simple script. Here is sample output:


      ************************************************ ******************************
      Initial System Crash Dump Analysis Output iscda Rev 1.4
      Sat Nov 6 01:02:45 EST 1999
      ************************************************ ******************************

      ************************************
      ** Initial information from adb **
      ************************************

      physmem 3de7
      utsname:
      utsname: sys SunOS
      utsname+0x101: node dogmeat.dog.meat.com
      utsname+0x202: release 5.5.1
      utsname+0x303: version Generic_103640-24
      utsname+0x404: machine sun4u
      srpc_domain:
      srpc_domain: dog.meat.com Domain name
      1999 Oct 31 01:54:41 Time of boot
      time:
      time: 1999 Nov 5 22:50:36 Time of crash

      Auditing is not enabled

      Quotas are not enabled


      ** Panic String **
      --------------------
      lm_blocks+0x37c: lm_get_sysid: cached entry not found


      ** Stack Backtrace **
      -----------------------
      complete_panic(0x9,0x301a987c,0x301a96a0,0x0,0x0 ,0x1) + 30
      do_panic(0x6058c078,0x301a987c,0x301a9ec0,0x600c 6b60,0x1043cf00,0x0) + 9c
      vcmn_err(0x3,0x6058c078,0x301a987c,0x69,0x0,0x3) + 150
      cmn_err(0x3,0x6058c078,0x301a9ec0,0x6084d0b8,0x2 c002a,0x1) + 1c
      lm_get_sysid(0x60141f58,0x6011df9c,0x6084d0d8,0x 3ffc,0x0,0x0) + 16c
      lm_nlm_reclaim(?) + e0
      lm_reclaim_lock(0x6044095c,0x6011df90,0x1040ae01 ,0x20,0x60440950,0x703)
      lm_relock_server(0x608f2910,0x6084df80,0x6058cdc 0,0x6058cdb4,0x60dea900,0x60dea900) + 1b0
      lm_recovery(0x301a9ad0,0x608592c8,0x608592bc,0x1 ,0x301a9ad0,0x0) + f0
      lm_nlm_dispatch(0x12,0x301a9c5c,0x0,0x60858378,0 x0,0x0) + 3ec
      svc_getreq(0x301a9c5c,0x60cc25a0,0x1,0x4,0x3,0x6 008e9b0) + 164
      svc_run(0x60cc25a0,0x6005be84,0x6005be7c,0x6005b e90,0x6005be8c,0x6005be44) + 3dc


      ** Per CPU information **
      ---------------------------
      ncpus:
      ncpus: 1 # of CPUs present
      ncpus_online:
      ncpus_online: 1 # of CPUs online


      cpu0+8: 1b0000 Thread address

      data address not found


      ** Stacktrace **
      -----------------
      l0 l1 l2 l3
      l4 l5 l6 l7
      i0 i1 i2 i3
      i4 i5 i6 i7

      0x301a96a0: 0 104146a8 0 0
      1 10406000 0 0
      9 301a987c 301a96a0 0
      0 1 301a9708 1001d3c8

      0x301a96a0: 0 cpu0 0 0
      1 vmhatstat+0x4d0 0 0
      SLOAD_DEBUG+1 0x301a987c 0x301a96a0 0
      0 1 0x301a9708 do_panic+0x9c


      0x301a9708: 104146a8 0 0 6001ddc0
      610aa3c8 2 1 1
      6058c078 301a987c 301a9ec0 600c6b60
      1043cf00 0 301a9768 100597f0

      0x301a9708: cpu0 0 0 0x6001ddc0
      0x610aa3c8 2 1 1
      lm_blocks+0x37c 0x301a987c 0x301a9ec0 spec_lostpage+0xc04
      strreflock 0 0x301a9768 vcmn_err+0x150


      0x301a9768: 9968c8d0 0 0 6c
      6c 6c 60581c40 2
      3 6058c078 301a987c 69
      0 3 301a97d0 10059690

      0x301a9768: 0x9968c8d0 0 0 PGSHIFT_DEBUG+0x5f
      PGSHIFT_DEBUG+0x5f PGSHIFT_DEBUG+0x5f
      acl_timer_type_v3+0xba1 2 3
      lm_blocks+0x37c 0x301a987c PGSHIFT_DEBUG+0x5c
      0 3 0x301a97d0 cmn_err+0x1c


      0x301a97d0: 10409d88 0 6068bbc0 0
      0 10409eb4 10409eb4 0
      3 6058c078 301a9ec0 6084d0b8
      2c002a 1 301a9830 60586d80

      0x301a97d0: mutex_ops 0 0x6068bbc0 0
      0 rwlock_ops rwlock_ops 0
      3 lm_blocks+0x37c 0x301a9ec0 0x6084d0b8
      0x2c002a 1 0x301a9830 lm_get_sysid+0x16c


      0x301a9830: 1 0 1 0
      3fff 0 60599e68 6084d080
      60141f58 6011df9c 6084d0d8 3ffc
      0 0 301a98a0 6085620c

      0x301a9830: 1 0 1 0
      PGSHIFT_DEBUG+0x3ff2 0 lm_sysids_lock
      0x6084d080 rootnex_ops+0x2438 0x6011df9c
      0x6084d0d8 PGSHIFT_DEBUG+0x3fef 0
      0 0x301a98a0 lm_nlm_reclaim+0xe0


      0x301a98a0: 1 606a4160 704 60b1b434
      4 301a9910 2 1
      6044095c 6011df90 1040ae01 20
      60440950 703 301a9958 6058b2c0

      0x301a98a0: 1 ltable+0x2d4 PGSHIFT_DEBUG+0x6f7
      0x60b1b434 PR_SIZE 0x301a9910 2
      1 0x6044095c 0x6011df90 utsname+0x101
      PGSHIFT_DEBUG+0x13 0x60440950 PGSHIFT_DEBUG+0x6f6
      0x301a9958 lm_relock_server+0x1b0

      0x301a9958: 6058cc00 6058cc00 6058cd94 6058cd88
      6058cd68 6058cd5c 1 4000
      608f2910 6084df80 6058cdc0 6058cdb4
      60dea900 60dea900 301a99c8 608558f8

      0x301a9958: lm_blocks+0xf04 lm_blocks+0xf04 lm_blocks+0x1098
      lm_blocks+0x108c lm_blocks+0x106c
      lm_blocks+0x1060 1 PGSHIFT_DEBUG+0x3ff3
      ism_off+0x39c 0x6084df80 lm_blocks+0x10c4
      lm_blocks+0x10b8 0x60dea900 0x60dea900
      0x301a99c8 lm_recovery+0xf0

      0x301a99c8: 3c0000 0 60599c00 1
      0 0 186b5 2
      301a9ad0 608592c8 608592bc 1
      301a9ad0 0 301a9a38 60855f2c

      0x301a99c8: 0x3c0000 0 0x60599c00 1
      0 0 0x186b5 2
      0x301a9ad0 block_lock_msg_disp+0xee0 block_lock_msg_disp+0xed4
      1 0x301a9ad0 0 0x301a9a38
      lm_nlm_dispatch+0x3ec

      0x301a9a38: 2d 60130c88 301a9ec0 0
      18350163 1 12 1
      12 301a9c5c 0 60858378
      0 0 301a9b08 6012adac

      0x301a9a38: PGSHIFT_DEBUG+0x20 svc_clts_op 0x301a9ec0
      0 0x18350163 1 PGSHIFT_DEBUG+5
      1 PGSHIFT_DEBUG+5 0x301a9c5c 0
      lm_nlm_disp+0x120 0 0
      0x301a9b08 svc_getreq+0x164

      0x301a9b08: 6013dda0 60169550 301a9cdc 1003d140
      ffffffff 6013ddd4 301a9c60 0
      301a9c5c 60cc25a0 1 4
      3 6008e9b0 301a9bb8 6012b274

      0x301a9b08: rqcred_lock ledmadelay+0x150 0x301a9cdc
      nfs_svc+0x140 VADDR_MASK_DEBUG svc_lock
      0x301a9c60 0 0x301a9c5c 0x60cc25a0
      1 PR_SIZE 3 scsi_log_mutex+0x4d24
      0x301a9bb8 svc_run+0x3dc

      0x301a9bb8: 6005be58 6005be74 6005be80 6005be4c
      6005be38 6005be3c 0 6005be78
      60cc25a0 6005be84 6005be7c 6005be90
      6005be8c 6005be44 301a9d20 10025470

      0x301a9bb8: pteminfo+0x2030 pteminfo+0x204c pteminfo+0x2058 pteminfo+0x2024
      pteminfo+0x2010 pteminfo+0x2014 0 pteminfo+0x2050
      0x60cc25a0 pteminfo+0x205c pteminfo+0x2054 pteminfo+0x2068
      pteminfo+0x2064 pteminfo+0x201c 0x301a9d20 thread_start+4


      0x301a9d20: 0 0 0 0
      0 0 0 0
      6005be30 0 0 0
      0 0 0 6012ae98

      0x301a9d20: 0 0 0 0
      0 0 0 0
      pteminfo+0x2008 0 0 0
      0 0 0 svc_run




      ** CPU structures **
      --------------------
      cpu0:
      cpu0: id seqid flags
      0 0 1b
      cpu0+0xc: thread idle_t pause
      301a9ec0 3002bec0 3016dec0
      cpu0+0x18: lwp callo fpowner
      0 0 0
      cpu0+0x24: next prev next on prev on
      104146a8 104146a8 104146a8 104146a8
      cpu0+0x34: lock npri queue limit actmap
      0 170 60643000 606437f8 60089218
      cpu0+0x44: maxrunpri max unb pri nrunnable
      100 100 1
      cpu0+0x50: runrun kprnrn dispthread thread lock
      1 1 301a9ec0 0
      cpu0+0x5c: intr_stack on_intr intr_thread intr_actv
      30027fa0 0 3001fec0 0
      cpu0+0x6c: base_spl
      0


      ** Msgbuf **
      ------------
      msgbuf:
      msgbuf: magic size bufx bufr
      8724786 1fe8 164b 0
      msgbuf+0x10: !Aô,0/espdma@e,8400000/esp@e,8800000/sd@1,0

      sd2 at esp0: target 2 lun 0
      sd2 is /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@2,0

      root on /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0:a fstyp
      e ufs
      zs0 at sbus0: SBus0 slot 0xf offset 0x1100000 Onboard device spa
      rc9 ipl 12
      zs0 is /sbus@1f,0/zs@f,1100000
      zs1 at sbus0: SBus0 slot 0xf offset 0x1000000 Onboard device spa
      rc9 ipl 12
      zs1 is /sbus@1f,0/zs@f,1000000
      keyboard is major minor
      mouse is major minor
      stdin is major minor
      stdout is major minor
      boot cpu (0) initialization complete - online
      ledma0 at sbus0: SBus0 slot 0xe offset 0x8400010
      le0 at ledma0: SBus0 slot 0xe offset 0x8c00000 Onboard device sp
      arc9 ipl 6
      le0 is /sbus@1f,0/ledma@e,8400010/le@e,8c00000
      lebuffer0 at sbus0: SBus0 slot 0x1 offset 0x40000
      le1 at lebuffer0: SBus0 slot 0x1 offset 0x60000 SBus level 4 spa
      rc9 ipl 7
      le1 is /sbus@1f,0/lebuffer@1,40000/le@1,60000
      dump on /dev/dsk/c0t0d0s1 size 1024448K
      syncing file systems...cpu0: SUNW,UltraSPARC (upaid 0 impl 0x10
      ver 0x22 clock 143 MHz)
      SunOS Release 5.5.1 Version Generic_103640-24 [UNIX(R) System V
      Release 4.0]
      Copyright (c) 1983-1996, Sun Microsystems, Inc.
      mem = 131072K (0x8000000)
      avail mem = 127811584
      Ethernet address = 8:0:20:79:c1:1
      root nexus = Sun Ultra 1 SBus (UltraSPARC 143MHz)
      sbus0 at root: UPA 0x1f 0x0 ...
      espdma0 at sbus0: SBus0 slot 0xe offset 0x8400000
      dma1 at sbus0: SBus0 slot 0x1 offset 0x81000
      esp0 at espdma0: SBus0 slot 0xe offset 0x8800000 Onboard device
      sparc9 ipl 4
      esp1 at dma1: SBus0 slot 0x1 offset 0x80000 SBus level 3 sparc9
      ipl 5
      sd0 at esp0: target 0 lun 0
      sd0 is /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0

      sd1 at esp0: target 1 lun 0
      sd1 is /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@1,0

      sd2 at esp0: target 2 lun 0
      sd2 is /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@2,0

      root on /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0:a fstyp
      e ufs
      zs0 at sbus0: SBus0 slot 0xf offset 0x1100000 Onboard device spa
      rc9 ipl 12
      zs0 is /sbus@1f,0/zs@f,1100000
      zs1 at sbus0: SBus0 slot 0xf offset 0x1000000 Onboard device spa
      rc9 ipl 12
      zs1 is /sbus@1f,0/zs@f,1000000
      keyboard is major minor
      mouse is major minor
      stdin is major minor
      stdout is major minor
      boot cpu (0) initialization complete - online
      ledma0 at sbus0: SBus0 slot 0xe offset 0x8400010
      le0 at ledma0: SBus0 slot 0xe offset 0x8c00000 Onboard device sp
      arc9 ipl 6
      le0 is /sbus@1f,0/ledma@e,8400010/le@e,8c00000
      lebuffer0 at sbus0: SBus0 slot 0x1 offset 0x40000
      le1 at lebuffer0: SBus0 slot 0x1 offset 0x60000 SBus level 4 spa
      rc9 ipl 7
      le1 is /sbus@1f,0/lebuffer@1,40000/le@1,60000
      dump on /dev/dsk/c0t0d0s1 size 1024448K
      panic[cpu0]/thread=0x3002bec0: zero
      syncing file systems... done
      2540 static and sysmap kernel pages
      39 dynamic kernel data pages
      183 kernel-pageable pages
      1 segkmap kernel pages
      0 segvn kernel pages
      0 current user process pages
      2763 total pages (2763 chunks)

      dumping to vp 601d307c, offset 2004688
      cpu0: SUNW,UltraSPARC (upaid 0 impl 0x10 ver 0x22 clock 143 MHz)

      SunOS Release 5.5.1 Version Generic_103640-24 [UNIX(R) System V
      Release 4.0]
      Copyright (c) 1983-1996, Sun Microsystems, Inc.
      mem = 131072K (0x8000000)
      avail mem = 127811584
      Ethernet address = 8:0:20:79:c1:1
      root nexus = Sun Ultra 1 SBus (UltraSPARC 143MHz)
      sbus0 at root: UPA 0x1f 0x0 ...
      espdma0 at sbus0: SBus0 slot 0xe offset 0x8400000
      dma1 at sbus0: SBus0 slot 0x1 offset 0x81000
      esp0 at espdma0: SBus0 slot 0xe offset 0x8800000 Onboard device
      sparc9 ipl 4
      esp1 at dma1: SBus0 slot 0x1 offset 0x80000 SBus level 3 sparc9
      ipl 5
      sd0 at esp0: target 0 lun 0
      sd0 is /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0

      sd1 at esp0: target 1 lun 0
      sd1 is /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@1,0

      sd2 at esp0: target 2 lun 0
      sd2 is /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@2,0

      root on /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0:a fstyp
      e ufs
      zs0 at sbus0: SBus0 slot 0xf offset 0x1100000 Onboard device spa
      rc9 ipl 12
      zs0 is /sbus@1f,0/zs@f,1100000
      zs1 at sbus0: SBus0 slot 0xf offset 0x1000000 Onboard device spa
      rc9 ipl 12
      zs1 is /sbus@1f,0/zs@f,1000000
      keyboard is major minor
      mouse is major minor
      stdin is major minor
      stdout is major minor
      boot cpu (0) initialization complete - online
      ledma0 at sbus0: SBus0 slot 0xe offset 0x8400010
      le0 at ledma0: SBus0 slot 0xe offset 0x8c00000 Onboard device sp
      arc9 ipl 6
      le0 is /sbus@1f,0/ledma@e,8400010/le@e,8c00000
      lebuffer0 at sbus0: SBus0 slot 0x1 offset 0x40000
      le1 at lebuffer0: SBus0 slot 0x1 offset 0x60000 SBus level 4 spa
      rc9 ipl 7
      le1 is /sbus@1f,0/lebuffer@1,40000/le@1,60000
      dump on /dev/dsk/c0t0d0s1 size 1024448K
      panic[cpu0]/thread=0x301a9ec0: lm_get_sysid: cached entry not fo
      und
      syncing file systems... done
      3116 static and sysmap kernel pages
      42 dynamic kernel data pages
      193 kernel-pageable pages
      0 segkmap kernel pages
      0 segvn kernel pages
      0 current user process pages
      3351 total pages (3351 chunks)

      dumping to vp 601d307c, offset

      physmem 3de7



      **************************************
      ** Process information from crash **
      **************************************

      dumpfile = vmcore.10, namelist = unix.10, outfile = stdout
      > PROC TABLE SIZE = 1978
      SLOT ST PID PPID PGID SID UID PRI NAME FLAGS
      0 t 0 0 0 0 0 96 sched load sys lock
      1 s 1 0 0 0 0 58 init load
      2 s 2 0 0 0 0 98 pageout load sys lock nowait
      3 s 3 0 0 0 0 60 fsflush load sys lock nowait
      4 s 402 400 400 400 60001 58 ns-httpd load jctl
      5 s 348 1 348 348 0 58 nsrexecd load jctl
      6 s 21941 1 475 475 0 40 start_hawk_gol load
      7 s 154 1 154 154 0 48 inetd load
      8 s 131 1 131 131 0 58 rpcbind load
      9 s 139 1 139 139 0 58 ypbind load
      10 s 159 1 159 159 0 58 statd load jctl
      11 s 161 1 161 161 0 59 lockd load
      12 s 133 1 133 133 0 43 keyserv load
      13 s 126 1 126 126 0 58 in.routed load
      14 s 189 1 189 189 0 59 automountd load
      15 s 193 1 193 193 0 58 syslogd load nowait
      16 s 287 1 0 0 18473 54 frg load
      17 s 222 1 222 222 0 58 lpsched load nowait
      18 s 229 222 222 222 0 58 lpNet load nowait jctl
      19 s 236 1 236 236 0 48 sendmail.8.8.5 load jctl
      20 s 212 1 212 212 0 52 nscd load
      21 s 206 1 206 206 0 28 cron load
      22 s 246 1 246 246 0 59 utmpd load
      23 s 272 1 272 272 0 48 snmpd load
      24 s 11735 154 154 154 0 58 ovtelnetd load
      25 s 688 1 688 688 0 58 sssd load nowait
      26 s 388 386 383 383 0 60 ns-admin load jctl
      27 s 382 1 0 0 0 48 ns-admin load jctl
      28 s 6008 206 206 206 0 60 cron load
      29 s 22258 1 475 475 0 58 disk_impl load
      30 s 386 383 383 383 0 58 ns-admin load jctl
      32 s 6017 206 206 206 0 60 cron load
      33 s 383 1 383 383 0 38 ns-admin load jctl
      34 s 400 1 400 400 60001 0 ns-httpd load jctl
      35 s 404 400 400 400 60001 59 ns-httpd load jctl
      36 s 405 400 400 400 60001 59 ns-httpd load jctl
      37 s 406 400 400 400 60001 59 ns-httpd load jctl
      38 s 414 1 414 414 0 60 ns-admin load nowait
      39 s 418 1 418 418 60001 58 uxwdog load
      40 s 419 418 418 418 60001 58 ns-httpd load nowait
      43 s 842 154 154 154 0 45 cachefsd load
      44 s 8337 154 8337 8337 0 28 in.rlogind load
      45 s 22058 22057 22058 22058 0 58 hawk load jctl
      46 s 570 1 570 570 60001 58 uxwdog load
      47 s 571 570 570 570 60001 58 ns-httpd load nowait
      48 s 5950 206 206 206 0 60 cron load
      50 r 22245 1 22245 22245 0 100 rvd load
      51 s 698 1 698 698 0 54 ttymon load
      52 s 714 697 697 697 0 58 ttymon load jctl
      53 s 697 1 697 697 0 58 sac load jctl
      55 s 671 1 0 0 249 58 gscgid load jctl
      59 s 5341 1 5339 5236 18473 40 frg load
      61 s 6020 6019 206 206 10048 60 sh load
      63 s 23525 23524 23525 23524 10972 58 ksh load
      64 s 11738 154 154 154 0 58 ovtelnetd load
      65 s 23711 23709 23711 23711 0 38 login load
      67 s 11034 11033 11034 11033 10974 58 ksh load
      69 s 22057 21941 475 475 0 59 perl load
      70 s 13052 154 13052 13052 0 28 in.rlogind load
      71 s 11740 11738 11740 11740 0 44 login load
      72 s 22264 1 475 475 0 58 st_impl load
      74 s 5999 21941 475 475 0 60 ps load
      76 s 6003 6002 206 206 10048 60 sh load
      77 s 6002 206 206 206 10048 38 sh load
      78 s 6001 206 206 206 0 60 cron load
      79 t 24903 16794 24903 13054 10954 31 getSkewFromHis load
      80 s 5998 206 206 206 0 60 cron load
      82 s 13054 13052 13054 13054 0 28 login load
      83 s 6021 154 154 154 0 58 ovtelnetd load
      84 s 8339 8337 8339 8339 0 44 login load
      86 s 6023 6021 6023 6023 0 48 login load
      87 s 5952 5951 206 206 10048 60 sh load
      88 s 16063 154 154 154 0 58 rpc.cmsd load
      89 s 13055 13054 13055 13054 10954 48 ksh load
      91 s 11776 11740 11776 11740 19171 58 csh load jctl
      92 s 11742 11737 11742 11737 19171 54 csh load jctl
      94 s 11737 11735 11737 11737 0 52 login load
      95 s 6016 206 206 206 0 60 cron load
      98 s 5951 206 206 206 10048 42 sh load
      99 s 5934 5933 206 206 144 60 sh load
      101 s 6014 206 206 206 0 60 cron load
      102 s 23709 154 23709 23709 0 38 in.rlogind load
      106 s 6018 206 206 206 0 60 cron load
      107 s 6011 206 206 206 10048 43 sh load
      109 s 5933 206 206 206 144 42 sh load
      111 s 5953 189 5953 5953 0 58 umount load
      113 s 8340 8339 8340 8339 10974 58 ksh load
      116 s 23522 154 23522 23522 0 54 in.rlogind load
      118 s 23524 23522 23524 23524 0 48 login load
      120 s 11033 11031 11033 11033 0 28 login load
      121 s 11031 154 11031 11031 0 38 in.rlogind load
      123 s 25245 154 154 154 0 58 sadmind load jctl
      126 s 6012 6011 206 206 10048 60 sh load
      130 s 6007 6005 6007 6007 0 46 login load
      131 s 16794 13055 16794 13054 10954 58 tcsh load jctl
      132 s 6013 154 154 154 478 18 in.rshd load nowait
      135 s 6000 206 206 206 0 60 cron load
      136 s 6010 206 206 206 0 60 cron load
      137 s 6019 206 206 206 10048 38 sh load
      138 s 6005 154 6005 6005 0 28 in.rlogind load
      140 s 23712 23711 23712 23711 10972 48 ksh load
      >


      ************************************************ ******
      ** Strings output of complete message ring buffer **
      ************************************************ ******

      Generic_103640-24
      lm_get_sysid: cached entry not found
      `H-?
      CL @
      0Ah}
      FPKu
      $P"
      @Hd@
      @@A
      X`ar0`@
      `|G8`PJh`
      8`vYh`ak
      P`D6
      H`DPP`A
      H`D-P`
      x`PQ
      m `v[
      p`|F
      `C* `@
      p`PY
      `EPP`s
      ``|F
      x`at
      `EA8`@
      `v[0`>
      (`ETp`
      8`E[
      H`C)
      H`C>
      `vGP`
      H`f#
      `LB@`O
      `au
      `C%@`C#x`C
      `axx`
      @`a~
      `vV8`s
      `}rH`@
      @`|Q

      ***********************
      ** Some Statistics **
      ***********************
      physmem 3de7


      ** Directory Name Lookup Cache Statistics **
      ----------------------------------------------
      ncsize:
      ncsize: 2181 Directory name cache size
      ncstats:
      ncstats: 61848318 # of cache hits that we used
      ncstats+4: 5161608 # of misses
      ncstats+8: 1523219 # of enters done
      ncstats+0xc: 1777 # of enters tried when already cached
      ncstats+0x10: 1676007 # of long names tried to enter
      ncstats+0x14: 1322750 # of long name tried to look up
      ncstats+0x18: 211239 # of times LRU list was empty
      ncstats+0x1c: 114368 # of purges of cache
      27 Hit rate percentage
      (See /usr/include/sys/dnlc.h for more information)


      ** Kernel Memory Request Statistics **
      ----------------------------------------
      Small Large Outsized
      symbol not found

      data address not found

      data address not found
      pagesize:
      pagesize: 8192 Memory page size
      (See /usr/include/sys/sysinfo.h for more information)


      ** Streams Statistics **
      --------------------------
      In use Total Maximum Failures
      symbol not found
      pagesize:
      pagesize: 2000 2000 1fff 0
      Queues
      maxautovec:
      maxautovec: 1 9de3bed0 f427a04c b407bfc8
      MsgBlks
      _kobj_boot+0xc: f227a048 b8103ff4 f027a044 ba102000
      LinkBlks
      (See /usr/include/sys/strstat.h for more information)

      physmem 3de7


      ** Shared Memory Tuning Variables (if in use) **
      --------------------------------------------------
      shminfo_shmmax:
      shminfo_shmmax: 536870912 Max segment size
      shminfo_shmmin:
      shminfo_shmmin: 1 Min segment size
      shminfo_shmmni:
      shminfo_shmmni: 256 Max identifiers
      shminfo_shmseg:
      shminfo_shmseg: 100 Max attached shm segs per proc

      physmem 3de7


      ** Semaphore Tuning Variables (if in use) **
      ----------------------------------------------
      seminfo_semmap:
      seminfo_semmap: 10 Entries per map
      seminfo_semmni:
      seminfo_semmni: 10 Max identifiers
      seminfo_semmns:
      seminfo_semmns: 60 Max in system
      seminfo_semmnu:
      seminfo_semmnu: 30 Max undos
      seminfo_semmsl:
      seminfo_semmsl: 25 Max sems per id
      seminfo_semopm:
      seminfo_semopm: 10 Max ops per semop
      seminfo_semume:
      seminfo_semume: 10 Max undos per proc
      seminfo_semusz:
      seminfo_semusz: 96 Max bytes in undos
      seminfo_semvmx:
      seminfo_semvmx: 32767 Max sem value
      seminfo_semaem:
      seminfo_semaem: 16384 Max adjust on exit

      physmem 3de7


      ** Message Queue Tuning Variables (if in use) **
      --------------------------------------------------
      symbol not found


      ************************************
      ** Current patch revision status **
      ************************************
      Patch: 103640-19 Obsoletes: 103591-09, 103658-02, 103920-05, 103600-18, 103609-02 Packages: SUNWcs
      u, SUNWcsr, SUNWkvm, SUNWcar, SUNWhea
      Patch: 103630-10 Obsoletes: Packages: SUNWcsu, SUNWcsr
      Patch: 104849-04 Obsoletes: 103006-06 Packages: SUNWcsu, SUNWcsr, SUNWhea
      Patch: 103582-15 Obsoletes: Packages: SUNWcsu, SUNWcsr
      Patch: 103603-07 Obsoletes: Packages: SUNWcsu
      Patch: 103612-39 Obsoletes: 103615-04, 103654-01 Packages: SUNWcsu, SUNWcsr, SUNWarc, SUNWscpu, SU
      NWfns, SUNWnisu, SUNWsra
      Patch: 103622-10 Obsoletes: Packages: SUNWcsu, SUNWcsr, SUNWhea
      Patch: 103623-03 Obsoletes: Packages: SUNWcsu
      Patch: 103627-02 Obsoletes: 103606-02, 105069-01 Packages: SUNWcsu, SUNWcsr, SUNWarc, SUNWbtool, S
      UNWhea, SUNWtoo, SUNWxcu4
      Patch: 103663-11 Obsoletes: 103683-01 Packages: SUNWcsu, SUNWcsr, SUNWhea

      ****************************************
      ** Hardware Configuration Information **
      ****************************************
      System Configuration: Sun Microsystems sun4u
      Memory size: 128 Megabytes
      System Peripherals (PROM Nodes):

      Node 0xf0029588
      idprom: 01800800.2079c101.00000000.79c101a9.00000000.00000 000.00000000.00000000
      reset-reason: 'S-POR'
      breakpoint-trap: 0000007f
      #size-cells: 00000002
      energystar-v2:
      model: 'SUNW,501-2836'
      name: 'SUNW,Ultra-1'
      clock-frequency: 044300e0
      banner-name: 'Sun Ultra 1 SBus (UltraSPARC 143MHz)'
      device_type: 'upa'

      Node 0xf002c7a4
      name: 'packages'

      Node 0xf0035cb0
      iso6429-1983-colors:
      name: 'terminal-emulator'

      Node 0xf0038a1c
      disk-write-fix:
      name: 'deblocker'

      Node 0xf00390e0
      name: 'obp-tftp'

      Node 0xf0042d10
      name: 'disk-label'

      Node 0xf0072654
      support:
      name: 'ufs-file-system'

      Node 0xf002c814
      stdout: fffe8810
      stdin: fffe8450
      mmu: fffea438
      memory: fffea638
      bootargs: 00
      bootpath: '/sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0: a'
      gateway-ip: 00000000
      server-ip: 00000000
      client-ip: 00000000
      stdout-#lines: ffffffff
      name: 'chosen'

      Node 0xf002c880
      version: 'OBP 3.0.4 1995/11/26 17:47'
      model: 'SUNW,3.0'
      decode-complete:
      aligned-allocator:
      relative-addressing:
      name: 'openprom'

      Node 0xf002c910
      name: 'client-services'

      Node 0xf002c9b8
      tpe-link-test?: 'true'
      scsi-initiator-id: '7'
      keyboard-click?: 'false'
      keymap:
      ttyb-rts-dtr-off: 'false'
      ttyb-ignore-cd: 'true'
      ttya-rts-dtr-off: 'false'
      ttya-ignore-cd: 'true'
      ttyb-mode: '9600,8,n,1,-'
      ttya-mode: '9600,8,n,1,-'
      sbus-probe-list: '012'
      mfg-mode: 'off '
      diag-level: 'max'
      #power-cycles: '19'
      system-board-serial#: '5012854003306'
      system-board-date: '30c5fe19'
      fcode-debug?: 'false'
      output-device: 'screen'
      input-device: 'keyboard'
      load-base: '16384'
      boot-command: 'boot'
      auto-boot?: 'true'
      watchdog-reboot?: 'false'
      diag-file:
      diag-device: 'net'
      boot-file:
      boot-device: 'disk'
      local-mac-address?: 'false'
      ansi-terminal?: 'true'
      screen-#columns: '80'
      screen-#rows: '34'
      silent-mode?: 'false'
      use-nvramrc?: 'false'
      nvramrc:
      security-mode: 'none'
      security-password:
      security-#badlogins: '0'
      oem-logo?: 'true'
      oem-banner: '008216'
      oem-banner?: 'false'
      hardware-revision:
      last-hardware-update: '981112'
      diag-switch?: 'false'
      name: 'options'

      Node 0xf002ca28
      net-aui: '/sbus/ledma@e,8400010:aui/le@e,8c00000'
      net-tpe: '/sbus/ledma@e,8400010:tpe/le@e,8c00000'
      net: '/sbus/ledma@e,8400010/le@e,8c00000'
      disk: '/sbus/espdma@e,8400000/esp@e,8800000/sd@0,0'
      cdrom: '/sbus/espdma@e,8400000/esp@e,8800000/sd@6,0:f'
      tape: '/sbus/espdma@e,8400000/esp@e,8800000/st@4,0'
      tape1: '/sbus/espdma@e,8400000/esp@e,8800000/st@5,0'
      tape0: '/sbus/espdma@e,8400000/esp@e,8800000/st@4,0'
      disk6: '/sbus/espdma@e,8400000/esp@e,8800000/sd@6,0'
      disk5: '/sbus/espdma@e,8400000/esp@e,8800000/sd@5,0'
      disk4: '/sbus/espdma@e,8400000/esp@e,8800000/sd@4,0'
      disk3: '/sbus/espdma@e,8400000/esp@e,8800000/sd@3,0'
      disk2: '/sbus/espdma@e,8400000/esp@e,8800000/sd@2,0'
      disk1: '/sbus/espdma@e,8400000/esp@e,8800000/sd@1,0'
      disk0: '/sbus/espdma@e,8400000/esp@e,8800000/sd@0,0'
      scsi: '/sbus/espdma@e,8400000/esp@e,8800000'
      floppy: '/sbus/SUNW,fdtwo'
      ttyb: '/sbus/zs@f,1100000:b'
      ttya: '/sbus/zs@f,1100000:a'
      keyboard!: '/sbus/zs@f,1000000:forcemode'
      keyboard: '/sbus/zs@f,1000000'
      name: 'aliases'

      Node 0xf004e8e8
      reg: 00000000.00000000.00000000.04000000.00000000.10000 000.00000000.02000000.00000000.20000
      000.00000000.02000000
      available: 00000000.21f3e000.00000000.00014000.00000000.21c02 000.00000000.000c2000.00000000
      .20000000.00000000.01400000.00000000.10000000.00 000000.02000000.00000000.00000000.00000000 .04000000
      name: 'memory'

      Node 0xf004eec8
      reg: 000001fe.00000000.00000000.00008000
      slot-address-bits: 0000001c
      up-burst-sizes: 0078007f
      burst-sizes: 00f8007f
      device_type: 'sbus'
      name: 'sbus'
      model: 'SUNW,sysio'
      thermal-interrupt:
      bus-parity-generated:
      upa-portid: 0000001f
      clock-frequency: 017d7840

      Node 0xf0059d28
      internal-loopback:
      dma-model: 'apcdma'
      interrupts: 00000024
      reg: 0000000d.0c000000.00000200
      name: 'SUNW,CS4231'

      Node 0xf0059e34
      address: fffb6000
      reg: 0000000f.01900000.00000001
      name: 'auxio'

      Node 0xf0059ec4
      version: 4f425020.332e302e.34203139.39352f31.312f3236.20313 73a.34370050.4f535420.322e30
      2e.34203139.39352f30.392f3138.2030333a.353900
      model: 'SUNW,525-1410'
      reg: 0000000f.00000000.00080000.0000000f.01380000.00080 000
      name: 'flashprom'

      Node 0xf0059f8c
      status: 'disabled'
      device_type: 'block'
      interrupts: 00000029
      reg: 0000000f.01400000.00000008
      name: 'SUNW,fdtwo'

      Node 0xf005a0c0
      address: fffba000
      reg: 0000000f.01200000.00002000
      model: 'mk48t59'
      name: 'eeprom'

      Node 0xf005a174
      port-b-ignore-cd:
      port-a-ignore-cd:
      address: fffd8000
      interrupts: 00000028
      device_type: 'serial'
      reg: 0000000f.01100000.00000004
      name: 'zs'

      Node 0xf005a24c
      address: fffb0000
      port-b-ignore-cd:
      port-a-ignore-cd:
      keyboard:
      interrupts: 00000028
      device_type: 'serial'
      reg: 0000000f.01000000.00000004
      name: 'zs'

      Node 0xf005a394
      address: fffb8000
      model: 'SUNW,sc-up'
      reg: 0000000f.01300000.00000008
      name: 'sc'

      Node 0xf005a448
      reg: 0000000f.01304000.00000003
      name: 'SUNW,pll'

      Node 0xf006120c
      reg: 0000000e.08400000.00000010
      name: 'espdma'

      Node 0xf00614a0
      device_type: 'scsi'
      clock-frequency: 02625a00
      interrupts: 00000020
      reg: 0000000e.08800000.00000040
      name: 'esp'

      Node 0xf0063c50
      device_type: 'block'
      name: 'sd'

      Node 0xf0064688
      device_type: 'byte'
      name: 'st'

      Node 0xf0065370
      burst-sizes: 0000003f
      reg: 0000000e.08400010.00000020
      name: 'ledma'

      Node 0xf0065908
      device_type: 'network'
      busmaster-regval: 00000007
      interrupts: 00000021
      reg: 0000000e.08c00000.00000004
      name: 'le'

      Node 0xf0068194
      reg: 0000000e.0c800000.0000001c
      interrupts: 00000022
      name: 'SUNW,bpp'

      Node 0xf006a504
      model: 'SUNW,500-2015'
      reg: 00000001.00081000.00000010
      name: 'dma'

      Node 0xf006aff4
      device_type: 'scsi'
      clock-frequency: 02625a00
      intr: 00000003.00000000
      reg: 00000001.00080000.00000040
      name: 'esp'
      chip: 'FAS236'
      interrupts: 00000003

      Node 0xf006e9f0
      device_type: 'block'
      name: 'sd'

      Node 0xf006f53c
      device_type: 'byte'
      name: 'st'

      Node 0xf00700ec
      burst-sizes: 0000003f
      model: 'SUNW,500-2015'
      reg: 00000001.00040000.00020000
      name: 'lebuffer'

      Node 0xf0070330
      device_type: 'network'
      intr: 00000004.00000000
      busmaster-regval: 00000005
      reg: 00000001.00060000.00000004
      alias: 'le'
      name: 'le'
      interrupts: 00000004

      Node 0xf006a084
      manufacturer#: 00000017
      implementation#: 00000010
      mask#: 00000022
      sparc-version: 00000009
      ecache-associativity: 00000001
      ecache-line-size: 00000040
      ecache-size: 00080000
      #dtlb-entries: 00000040
      dcache-associativity: 00000001
      dcache-line-size: 00000020
      dcache-size: 00004000
      #itlb-entries: 00000040
      icache-associativity: 00000002
      icache-line-size: 00000020
      icache-size: 00004000
      upa-portid: 00000000
      clock-frequency: 088601c0
      reg: 000001c0.00000000.00000000.00000008
      device_type: 'cpu'
      name: 'SUNW,UltraSPARC'

      *****
      Done!

    3. Re:So, the value is... by in8 · · Score: 1

      This is good news!

      I liked this on IRIX, as it told me what each CPU was doing at the time of the crash in human readable form. While I maynot have know exactly what died it was nicer to get some idea than reading a hex chars. If you had an enterprise service level contract, SGI could then analyze the dump and determine what went wrong. Naturally, they're not the only OS vendor that can do it.

      BUT - it is great news to here that vendors with enterprise experience are improving Linux!

      Hmmm - why can't SGI make cobalt style machines? at least then we could afford some stocks...

  6. Re:Uhhhh, This isn't a new thing. by Anonymous Coward · · Score: 0

    What do you mean by "non-standard"?

  7. Re:Here's one situation where it wouldn't help.... by timecop · · Score: 1

    That guy has no clue, its obvious.
    He runs a default RedHat install, with everything enabled and still running the same kernel that came with it, he has no SCSI devices, or a clue to even know what SCSI is, his largest partition is probably a 15GB Windows98SE partition, and he boots linux to winnuke his non-elite irc friends.

    Come on, "Stopping md devices..."? Does he actually use md features? I seriously doubt it. Its just that default RootHat install pushed it down his throat. And apmd? It's kind of pointless on a AC powered system. And it's really pointless to run RedHat on a laptop since even the "Laptop" installation still installs updated whichi will spin your hdd every 5 seconds and make your battery last less than it does in Win95.

    I fucking hate ignorant people that Redhat and similar idiotic distributions bring to the world.

  8. Re:This would be GREAT if the Linux kernel crashed by Anonymous Coward · · Score: 0

    Either
    a) You're running kernels with odd numbers (development kernels 2.1.x or 2.3.x)
    b) You've got bad/unsupported hardware.
    c) Your computer is getting bombarded with abnormally high levels of gamma rays...

  9. Re:This is a Good Thing (tm)! by Anonymous Coward · · Score: 0

    Yeah, I have to thank SGI for all the major contributions, but I would really like to know what their business model with linux is.. how are they going to make a profit?

    oops. I forgot. Companies don't need to show a profit anymore as long as they have a cool URL and do something with linux or the internet. Too bad sgi can't have another IPO...

  10. Re:This would be GREAT if the Linux kernel crashed by Anonymous Coward · · Score: 0

    Or you're just cool like me. :) I have an unnatural ability to crash machines. Except the Win98 box. That thing's been like a rock.

    Like I said: unnatural.

    Before I got rid of it, my Linux box crashed regularly. No weird hardware or unusual drivers. The most work I'd give it was KDE.

    Unnatural.

    Eric

  11. Re:WinDbg by Anonymous Coward · · Score: 0

    Even though they are moving those pages to Win2k, windbg is most likely backward compatible to at least NT 4. If you do try it, you'll need a correct set of symbols for the machine that generated the dump. With those and the dump, it can be viewed on any machine.

  12. Re:NT does this by Anonymous Coward · · Score: 0

    HAHAHAHA. Thanks, I needed a good laugh. You've obviously never worked in the industry. UNIX admins make signifigantly more than NT admins, because they are way more valuable. One good UNIX admin can run a lot of machines, which can do a lot more than the same number of NT boxes. The NT admins are too busy getting paged out of bed to reboot the server.

  13. Did you GPL the code to crash the kernel? by CraigMcPherson · · Score: 3

    You should have considered placing it under a BSD-style license. Microsoft's feelings are going to be hurt over the fact that they can't incorporate it into Windows 2000.

    1. Re:Did you GPL the code to crash the kernel? by Dwonis · · Score: 1

      Heh, yeah, and wouldn't it be sad when MS would find out that it's illegal for then to use the code, because the BSDL conflicts with the GPL!
      --------
      "I already have all the latest software."

    2. Re:Did you GPL the code to crash the kernel? by Seraph · · Score: 1

      BSDL

      Blue Screen of Death License?

  14. Wonderful! by DdJ · · Score: 1

    One of the things I remember back from my days when I was tinkering with SunOS 3.4 and Ultrix and IBM's AOS (not AIX) was that many BSDish Unixes would write what were basically kernel "core dumps" to the swap partition when they died (I may be getting details wrong -- might not have been swap file, might not have been all those OSes, etc.). Sophisticated gurus could then fix things. (Back in those days I was not enough of a guru to do this myself, but I lived and worked with people who were.)

    I think it's *wonderful* that a facility like this is coming to Linux. It makes me much more enthusiastic about taking on kernel hacking myself.

    But out of fairness I do have to ask... don't the BSDoid operating systems already have this?

    And it's a little embarassing to point out that NT has something like this as well.

  15. Re:great idea but... by DragonHawk · · Score: 1

    What happens when the LKCDA crashes during a system crash? Who recovers from that??

    Nobody. A crash dumper is going to be a minimal, always-resident program designed to simply copy physical memory to disk. If that can't be done, the system is either fried at the hardware level, or is so far corrupted that a core dump wouldn't mean much anyway.

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  16. Redundant Kernels by cdlu · · Score: 2

    I have been thinking about a solution to kernel panics and no-reboot kernel upgrades for a while, and here is the only thing I have come up with that seems viable:

    We have redundant power supplies, hard drives, and many other pieces of hardware. I am thinking it may be good for developpers, at any rate, to use redundant kernels. Kernel 1 dies, kernel 2 realises this and kills kernel 1 and takes over the system. Interrupt in service: a few clock cycles. Perhaps a new runlevel should be implemented into the linux kernel...runlevel 7, which would be against the POSIX standard I think, not sure, but would allow a condition in which the kernel is replacing itself in memory, by having a redundant kernel take over while one is being replaced in memory, and the second kernel handing off resources to the new primary kernel when it is ready, returning to the previous runlevel.

    The long and the short of what I am saying is that there should be a second kernel in memory at all times ready to take over at any time, but programmed to not run until the first kernel dies or is being upgraded.

    The disadvantage: it starts to consume extra memory resources, and process table entries, and will take a long time to perfect.

    What do you think?

    1. Re:Redundant Kernels by jd · · Score: 2
      Not really. What you're describing could be modelled as two virtual machines, each running Linux and High Availability software. If one kernel dies, the processes migrate to the second virtual machine.

      This -could- be done with only minimal enhancements to Linux and the existing HA software - the support of two (or more) virtual machines within one (or more) physical machines.

      Actually, this would go beyond crash recovery, as you could use this to do better scaling of multi-processor/multi-machine environments. Instead of trying to map N processes onto M components, you're only mapping N processes onto N virtual machines, and then N virtual machines onto M components. Because you already know and understand virtual machines, that's a much easier problem to solve.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    2. Re:Redundant Kernels by Anders · · Score: 1

      You might have redundant kernels of the same version, as long as you do not plan to use this for upgrading to newer kernels.

      Data structures regularly change even in the stable kernels and providing an upgrade path for this would clutter the kernel to an extent that I do not believe Linus would accept.

      --

    3. Re:Redundant Kernels by Anonymous Coward · · Score: 0

      This is bogus, you need a Fault tolerant system, either a 3 way voting system or a 2 way transaction comparision. All the mega joints
      out there that absolutely cannot have 1 second
      of outage are running FT boxes.

    4. Re:Redundant Kernels by mangino · · Score: 1

      How do you keep the state of both kernels the same? If you can't keep them in exactly the same state, you end up with a worse problem than if the machine just crashed. If you keep both in the same state, then they should both crash at the same time.
      In the UNIX worlds, machine oops and panic for a reason, because the machine is in an unstable state and continuing to execute would possibly allow data corrpution. This is a Good Thing (tm)
      If you need redundancy on this level, look at clustering technologies with process migration and n+1 sparing.
      --
      Mike Mangino Consultant, Analysts International

      --
      Mike Mangino
      mmangino@acm.org
    5. Re:Redundant Kernels by Salamander · · Score: 2

      >We have redundant power supplies, hard drives, and many other pieces of hardware. I am thinking it may be good for developpers, at any rate, to use redundant kernels.
      >...
      >Interrupt in service: a few clock cycles

      Sorry, but this doesn't work. You'd have to replicate the entire kernel address space including that used by drivers and on behalf of user programs for it to be effective. Many crashes actually result from corruption of some part of kernel memory, so if the two kernels share data they'd both crash at once. In addition, because the in-kernel data structures kept on their behalf might be different (if the two kernels were precisely identical they'd also crash in tandem) the user programs would have to be duplicated too. Now you've halved your memory and CPU resources, plus you're effectively doing a context switch "every few instructions" to go from one kernel to the other. Your performance is going to be totally abysmal.

      The solution? Do what fault-tolerant systems already do and replicte the hardware as well. Been there, done that, works OK but gives lousy performance/dollar compared to non-FT alternatives. If you don't want to pay that premium you can go with a clustered highly available system such as the ones I once helped develop. Unlike an FT system, an HA system will allow an interruption when a component fails, but the duration of that interruption will be very short relative to a "vanilla" non-FT non-HA system. Also, in the absence of failures the component nodes of an HA cluster (a well-designed one, anyway) are able to process their own independent workloads instead of sitting around idle or duplicating each other's work.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    6. Re:Redundant Kernels by GoRK · · Score: 1

      What you're talking about here is probably impossible with the current kernel/process/user design of almost all modern computing.

      The easiest way to think about this problem is to think about what causes the kernel to die. So, what can happen?

      If it's a hardware fault, e.g. failed memory, disk, processor then the only real way to fix it is with hardware redundancy. Yes, i know there are software workarounds including a patch for the linux kernel that allows it to detect and avoid bad memory segments; but if these are included in a running kernel, a redundant sort of kernel isnt going to do any good anyway.

      If it's a software fault and the kernel dies then how do you suppose the "other kernel" can pick up right where the first left off? It can't pick right up or it's going to crash for the same reason the first one did. You can't use a different kernel version as your redundant kernel because it has to work exactly the same way as the kernel that just crashed. So at the very least you'd have to drop all of your process and go through kernel bootstrap again. In order for this to work you'd have to basiclly keep kernel 2 a certain amount of time behind kernel 1. Take the model that the EROS operating system uses. Every so many minutes the system state is saved in such a way that you can go back and restore the state of the computer at any time. This is the only way to really pull off software redundancy. It's also highly doubtful that it could ever be implemented in today's mainstream operating systems.

      So what's the best way to do things today?
      Hardware parallelism and software-managed high-availability are the way to go for servers. Hardware and Software watchdog timers can already reboot unresponsive computers.

      There is something close to what you are talking about, though. Redundant systems are available that use one Fibre-Channel RAID storage array between two completely seperate (hardware) systems in one chassis. If hardware set 1 fails, hardware set 2 reads in the saved system state and resumes operation. If the software crashes a hardware set, though, the system still goes down -- but then the other hardware set comes up by loading a post-boot image so the downtime is only a few seconds while another clean system comes up.

      ~GoRK

      As far as redundancy in your desktop computer -- well

  17. Re:Is this a new thing or just new to SGI? by DragonHawk · · Score: 2

    This is not a core dump of a running application, but rather, a core dump of the entire running system. If a kernel failure occurs, this patch will dump the contents of system to memory to disk, allowing you to analyize system state from just before when the crash occured.

    This would be very useful, for example, when debugging a device driver. It is not something the end-user, or even system administrator, is likely to use. It is for the kernel developer.

    Other OSes (Sun Solaris, SGI IRIX, Novell Netware, to name a few) have had this capability, but Linux has not. Linux has traditionally dumped a summary of the kernel state to the screen, but that is (1) tedious to copy down by hand (which you have to, since the system is dead), and (2) not as complete as an entire system image is.

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  18. Re:Netware by Anonymous Coward · · Score: 2

    WinDbg, while probably not redistributable, is a free download from MS. It can read NT kernel dumps. Try here or here. Unfortunately, they're already orienting these places to Win2k.

  19. Re:sun suing by Guy+Harris · · Score: 2
    IRIX machines have been doing this for quite a while....

    IRIX isn't Sun's UNIX, it's SGI's UNIX. They're unlikely to sue themselves for stealing an idea from IRIX....

    BSD, as others have noted, has had it for ages; many other flavors of UNIX probably got the idea (and, in some if not all cases, the code) from BSD.

  20. BSOD for linux by Anonymous Coward · · Score: 0

    Finally this will permit to implement the feature
    most requested by Windows users: the Blue Screen of Death.

    (P.S. Isn't Linux NOT supposed to crash?)

    1. Re:BSOD for linux by Anonymous Coward · · Score: 0

      Well it was .. but we have decided to let it crash more often so it will be more userfriendly

  21. Re: NT Memory dump by Raetsel · · Score: 2
    Yes, NT has a "Write debugging information to: (generally %SystemRoot%\MEMORY.DMP) option. Then you have a copy of your last-state memory in this massive file. This is not news

    What jafac was saying is that Microsoft does not give you or offer any low-cost, distributable tools for making sense out of this massive pile of arcane charachters.

    Got 128 MB of system memory on a NT workstation? 512 MB on a server? Hope you've got Einstein and a couple years to sort through the thing by hand to find your problem!!

    And UnknownSoldier has another very good point: The analysis tools are not cheap, and you can't share!

    C'mon Microsoft, didn't you learn anything in kindergarden?

    --

    "...America's great minds of today, teaching America's great minds of tomorrow. Poor bastards." -- A Beautiful Min
  22. Core dump on demand? by gorilla · · Score: 1
    One of the nice things I remember going back to my NCR tower days was the ability to take a core dump even when the system hadn't paniced.

    On the NCR tower this was done by toggling a switch to get into the boot ROM, then choosing appropriate options, then the memory would be written to the dump device.

    This helped us diagnose many strange crashes where the system wasn't functioning correctly, but it hadn't actualy paniced, for example, one system had init die, which made logging in a bit hard, but the kernel was still running.

    I've missed this feature on more recent PCish hardware, as they don't really have a boot ROM.

    Perhaps someone would like to make a more Linux biased BIOS, which could include these sort of nice features.

    1. Re:Core dump on demand? by Anonymous Coward · · Score: 0

      There actually is the OpenBIOS project (www.openbios.org, but that's never been up that I've seen; try http://www.freiburg.linux.de/OpenBIOS/), but I don't think that what you're talking about is something that would be done in BIOS on x86 hardware. AFAIK, the BIOS boots, starts the boot loader, provides some disk reading functions and the like to the boot-loader, which loads the kernel image, which uncompresses itself, bootstaps, and then, as its first act as a running kernel, switches to Protected mode, and makes the bios irrelevant (ie: never used for anything). This would probably have to be done via a (privileged) system call on Linux. I do agree, though, that this would be a very nice feature for Linux to have... another application would be to snapshot the system, and reload that snapshot at a later date to exactly recreate the system as it was then -- though there are several other technical hurdles to making this possible.

    2. Re:Core dump on demand? by Anonymous Coward · · Score: 0

      Well, FreeBSD has the ability to cause a system panic arbitarily. Go to the text console and hit control-alt-escape and it will drop into the kernel debugger (DDB). If you have enabled serial debugging you can plug another machine into the test box, and when you send a serial break it will drop to the debugger as well. You can then put it in GDB mode and attach the remote GDB to the test machine. Also when you generate a crash dump in FreeBSD you operate on it as follows.. cd /var/crash gdb -k kernel.1 vmcore.1 And its just like using GDB normally, no new program to learn. PS if you want to force a running program to generate a core send it a SIGABRT (unless it blocks it) or you can run gdb and do 'attach xxx' where xxx is the PID and it will catch it and you can watch it on the fly (VERY handy to find out why your process is sitting in RUN :) (Note I said FreeBSD for the above stuff but I am 99% sure it works with NetBSD and OpenBSD too)

    3. Re:Core dump on demand? by Raven667 · · Score: 1

      Hear, hear!

      This shouldn't be two hard to implement, just make a clone of or license someone elses boot prom. Like Apples FORTH interpreter or something. Start putting this on new PCI only boards, the ones without any serial/parallel/ps2 ports. There backwards compatability isn't a problem, you only need limited support from some popular OSs (Windows9x is really the only one that uses the BIOS for much of anything). Maybe you can even eschew Win9x compatability seeing that Win2K, BeOS, Linux, etc would be available at the time.

      Just my $0.02 US.

      --
      -- Remember: Wherever you go, there you are!
  23. Re:Redundant Kernels - I like it... by Anonymous Coward · · Score: 0

    ...reminds me of the ping scheme used for redundant servers;

    1. The backup mirrors the main server and pings it at the same time.

    2. Ping lost? The backup server assumes the IP of the main server and keeps on going.

    3. Administrator alerted, and primary server is fixed, placed in backup server mode. Repeat #1.

  24. Re:Is this a new thing or just new to SGI? by Atomic+Frog · · Score: 1

    I don't think so.
    As far as I know, OS/2 has had this for years now.
    While it will tell you to write down the information which it dumps to screen (stupid!), it actually also saves a copy to disk.

    The only time I've had the pleasure of this experience was when I fried my mobo...

  25. No one asked if this used any IRIX or UNIX code... by Alfthemack · · Score: 1

    I hope SGI would be smart enough to do this "clean room." If not, SCO (not Sun, not HP, not IBM, not Compaq, not GNU, not AT&T, not Novell) could sue them. SCO owns the UNIX source code.

    Let me know (of course you will) if I'm wrong on this.


    --
    --Al
  26. if only.. by Thrakkerzog · · Score: 0

    my kernel crashed!

    I guess it's great for development, though..

    1. Re:if only.. by Thrakkerzog · · Score: 1

      Well, chances are this article does has nothing to do with your every-day user. I tried booting 2.3.x once, and it was a no-go. (scsi driver failed).

      as for redundant.. well, at the time I posted, it wasn't redundant. ;-)

      and.. NO, I don't sit there waiting for first post.. It just happened to be that way when I checked. I saw an article about kernel panics.. and I thought.. "Well, my kernel has _never_ panicked, this is pretty useless!"

      As I was typing, though.. I figured it would be a great help to kernel hackers.

    2. Re:if only.. by Foogle · · Score: 1
      I think it's funny when the #1 post get's moderated as "Redundant". Although in the case of 1st posters, I guess we've heard it all before.

      -----------

      "You can't shake the Devil's hand and say you're only kidding."

  27. Kernel Panic by locoluis · · Score: 1

    Kernel Panic: Linux Kernel Crash Dump Subsystem received signal 11. giving up.

  28. ... by Signal+11 · · Score: 0

    I wonder how they tested their software.. considering linux crashes so rarely. *rimshot*

    --

    1. Re:... by Anonymous Coward · · Score: 0

      Running bleeding edge development kernels likely. Or introducing a few bugs on purpose :-)

    2. Re:... by yakker · · Score: 2

      We actually ran into very few crashes, so we added code to the kernel to crash it for us with user level commands. The code for crashing the kernel is listed on the LKCD FAQ page: http://oss.sgi.com/projects/lkcd/faq.html

  29. Re:This is great - now for truss by Anonymous Coward · · Score: 0

    Also there is pstack. Is there a linux debugging page? If not there needs to be one.

  30. :-) by Yebyen · · Score: 1

    Sounds like a good idea... although windows could use this more, lol. But seriously, i have noticed some problems when it freezes (yes it has happened to me, but rarely) i have no idea why. [coughnetscape].

    If we fail, we will lose the war.
    Had to do it lol

    --
    Restating the obvious since nineteen aught five.
    1. Re::-) by MaxVlast · · Score: 1

      > If we fail, we will lose the war.
      > Had to do it lol

      What's that from?

      --
      Max V.

      --
      There should be a moratorium on the use of the apostrophe.
      Max V.
      NeXTMail/MIME Mail welcome
    2. Re::-) by Anonymous Coward · · Score: 0

      Windows (meaning NT) has had this for years. Good to see Linux is catching up to about 1994.

    3. Re::-) by MaxVlast · · Score: 1

      Never mind. I read down a little further and read the article.

      --
      Max V.

      --
      There should be a moratorium on the use of the apostrophe.
      Max V.
      NeXTMail/MIME Mail welcome
  31. Re:I doubt ext3 will be in Linux 2.4 by axboe · · Score: 1

    I talked to Stephen at the Expo in London and
    is not his intention to push this into 2.3. So
    unless he (and Linus) changes his mind, it won't
    be going into 2.3.

  32. SCSI only? by Mr.+Piccolo · · Score: 1

    The description says that it saves the dump to a SCSI partition. What happens if you're running IDE?

    I think the idea is pretty cool -- no more trying to figure out why ksymoops didn't grok what you hastily scribbled down. I suppose all the hardcore kernel hackers will cry "Sacrilege!" though.



    P.S. Sun won't sue for stealing their crash dump idea, right? ;-)

    --
    Glückwünsche, haben Sie Slashdot ermordet, indem Sie zum korporativen Druck beugten und Subskriptionen einlei
    1. Re:SCSI only? by yakker · · Score: 2

      The SCSI partition requirement is only because raw I/O only works for SCSI right now. As soon as an IDE driver works for raw I/O under linux2.X.Y, the LKCD project should be very easy to make work with IDE swap partitions. Please review http://oss.sgi.com/projects/lkcd/faq.html for more information about restrictions.

  33. Re:I doubt ext3 will be in Linux 2.4 by axboe · · Score: 1

    Another comment - I know that reiserfs is being submitted for 2.3, but nobody knows whether Linus decides that it goes in at this time. It would be very nice to have for 2.4, though.

    And 2.3 is not as simple as you think - although it is "just" ext2 with a journal, you have to consider stuff write ordering, for instance.

  34. Is this a new thing or just new to SGI? by Namaste · · Score: 1

    Forgive my ignorance, but I'm not to educated on the "under the hood" stuff in *nix environments, but is this a new thing? I've noticed that under Solaris that I've got core files after a crash. Are these the same type of thing or do they not apply to the kernal? If not what are core files for? Do they have any use beyond cluttering up directories?

    1. Re:Is this a new thing or just new to SGI? by chadmulligan · · Score: 1
      ive my ignorance, but I'm not to educated on the "under the hood" stuff in *nix environments, but is this a new thing?

      It's decades old, in fact. When I was debugging patches to DOS/360 and OS/360 (for IBM mainframes) and MCP (for Burroughs mainframes) a core dump, directed to the high-speed printer, was an invaluable tool.

      Once RAM sizes passed the megabyte mark, the effectiveness was much reduced; it was just too much paper to page through. There was enough hardware information (3 extra tag bits for each 48-bit word) to allow the MCP core dump to be formatted into data, code, and stack areas. The IBM dumps were tough going to decode. On a modern microcomputer, of course, lots of other things like page tables and registers have to printed out, too...

    2. Re:Is this a new thing or just new to SGI? by Mentat21 · · Score: 1

      BSDI has done this for some time. It's actually really handy when you're messing with stuff. Now if it also supported kgdb, that would be even cooler.

    3. Re:Is this a new thing or just new to SGI? by Eric+Smith · · Score: 2
      Part of the problem with doing it under Linux is where do you dump to? And how do you know the location which the kernel points to for it to dump to isn't corrupted?
      This problem isn't unique to Linux. I don't know what SGI has done (either for Linux or IRIX), but an obvious approach is to use a fixed location on the disk, such as the tail end of logical cylinder 0 (normally unused), or to designate a special "core dump" partition.
      Most Unices (eg Solaris, Irix, HP/UX) have some hardware support to help with doing this AFAIK,
      Nope. It's all done in software. However, I could imagine that they might possibly have disk drivers with a special core-dump mode that is less dependent on the rest of the kernel, i.e., perhaps it would use polling rather than interrupts. On the other hand, maybe they just assume that the system is working well enough that the disk driver is OK. Often a panic that is caused in some other part of the system won't hurt the disk driver (although the file system code is more delicate, so a kernel dump ideally should bypass the FS).
      but with the wide variety of x86 hardware, not to mention all the other platforms Linux runs on, that's not a wonderful option in Linux.

      Even on other OSes the core dump doesn't always work. If things get sufficiently screwed up, the system can't write to the disk. But in my experience on other systems, core dumps work most of the time, which makes them quite worthwhile.

      I worked at a router company for five years (and am going to start a new job at another router company on Monday). Our routers could core dump either to floppy or across ethernet to a TFTP server. We found core dumps to be very useful, both during development, and for analyzing failures at customer sites (which we obviously tried hard to avoid).

      Some of the posters here seemed to question the utility of kernel core dumps, and point out that their kernel doesn't crash. While those people might not need the core dump feature, perhaps they should appreciate that it might help the developers maintain a high standard of quality going forward. As the Linux kernel continues to support an every increasing number of device types, expansion busses (such as 1394 and USB), file systems, etc., it will become correspondingly more difficult to keep it robust, and every tool that can be made available to the developers to assist with this should be welcomed.

    4. Re:Is this a new thing or just new to SGI? by Unknwn · · Score: 2

      Part of the problem with doing it under Linux is where do you dump to? And how do you know the location which the kernel points to for it to dump to isn't corrupted? Most Unices (eg Solaris, Irix, HP/UX) have some hardware support to help with doing this AFAIK, but with the wide variety of x86 hardware, not to mention all the other platforms Linux runs on, that's not a wonderful option in Linux.

      With that said, this is a great thing in my opinion... though I haven't tried it yet to see exactly how they implemented it.

      --
      Jeremy Katz

    5. Re:Is this a new thing or just new to SGI? by Zooks! · · Score: 2

      NetBSD can dump a core for the kernel. In fact, so can many other OS's.

      The trick is writing a nice little routine that is solid enough and self-contained enough to dump the memory to disk when the kernel dies.

      Of course this doesn't alway work. The exception handler code might be messed up or the disk controller might be in some bad state, but for the most part, kernel exceptions aren't so fatal that they wedge the machine.

      --

      --

      "I'm too old to use Emacs." -- Rod MacDonald

    6. Re:Is this a new thing or just new to SGI? by Otto · · Score: 3

      I've noticed that under Solaris that I've got core files after a crash.

      As I understand it, the core files (which are not just Solaris, BTW) are a memory dump when an application crashes. I believe that it wasn't possible to do this with a kernel, because the kernel is the guy who is actually writing the core file. I'm probably wrong in specific bits here.

      Anyway, core files can be extremely handy for debugging and such. They're just not very easy to examine. :-)


      ---

      --
      - Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.
  35. Re:This is a Good Thing (tm)! by Anonymous Coward · · Score: 0


    why is this a good thing?? it makes no sense. SGI knows nothing about making enterprise class operating systems, but they sure do know alot about making operating systems crash...

    their hardware sucks, their contributions to linux have been pathetic. alan cox/linus are definately more adept at coding, OS scalability, and other issues involved with stuff of this nature.

    and of course the *PROVEN* fact that linux does not crash, period, the end. it never has, it never will. solaris, irix, BSD all *WISH* they had the scalability and reliability that Linux has already achieved.

    forget SGI, we dont need their lousy contributions, they are just leeches on the open source revolution.

    LiNuX MaN

  36. Crash Investigation by Microlith · · Score: 1

    This is good... lets you see just what went wrong when the server went down for the first time in 2 years. Should make for finding the bad programs that do bring linux down.

  37. Whoa! by Anonymous Coward · · Score: 0

    Let me state right up front that I am a dedicated Linux advocate. Having that out of the way I must ask WTF you're smoking? Linux is damn cool but your post is so devoid of reality that it looks like the ravings of a fscking idiot

  38. Cool! by Mark+F.+Komarinski · · Score: 1

    Having worked in a Solaris shop, you can see the value of having crash dumps to send to your vendor.

    Actually, we were the vendor and got crash dumps from customers that was able to pinpoint very quickly what the problem was. Once that was found, it was easy to fix. Without the crash dumps, it could take weeks to find the cause of a nasty bug. Especially intermittent ones.

    With Linux having this feature, it'll be easier for driver authors to debug their code, and most likely boost the confidence of customers who want 99.999% uptime.

    --
    -- Ever notice that fast-burning fuse looks exactly the same as slow-burning fuse? I didn't... (Edgar Montrose)
    1. Re:Cool! by Anonymous Coward · · Score: 0

      However, customers who want 99.999% uptime should probably not be running kernel versions with buggy drivers, and indeed should be running tightly tuned kernels with anything not absolutely necessary not enabled (in other words, not even compiled in).

      This will be a useful feature for developers, not post-mortem debugging of production machines.

  39. Re:great idea but... by seebs · · Score: 1

    Then you get a "double panic", a very cryptic message, and no crash dump. Very rare, but it can happen.

    At least, that's how *BSD handles it. "double panic" is engineereese for "fix your broken hardware".

    --
    My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
  40. Re:Uhhhh, This isn't a new thing. by jmv · · Score: 1

    Has anyone said it was a new thing. Linux just lacked a kernel debugger and now it has one. Linux still lacks a journaled file system, and will eventually have one. Nobody's saying it's new, but it's still a reason to be happy for.

  41. Memory image will help. by Anonymous Coward · · Score: 0
    Having the memory image will help. What would help more, is to know how the memory image got there!

    We will then know, where it crashed. What is more important, is why it crashed! By the time it crashed, the real cause of the crash could be 10 layers away.

    Any more information will help though.

    What would be better, is not crashing. :) Linux crashing? Is that a Windows emulation feature?

    Injured software engineer wins against Mattel!

    1. Re:Memory image will help. by yakker · · Score: 4
      I figured I'd include an example of a crash dump analysis report that is created in /var/log/vmdump. This is what you'll get after a kernel panic (or something similar to it). You can also run 'lcrash' on the map/vmdump files and perform interactive analysis, such as a 'dump' of memory, or a 'dis'assemble of some instructions, etc. Sorry, the spacing's not going to look exactly right ...



      =======================
      LCRASH CORE FILE REPORT
      =======================

      GENERATED ON:
      Thu Nov 4 19:15:19 1999


      TIME OF CRASH:
      Fri Nov 5 03:12:27 1999


      PANIC STRING:
      User created crash dump

      MAP:
      map.5

      VMDUMP:
      vmdump.5

      ================
      COREFILE SUMMARY
      ================

      The system died due to a software failure.

      ===================
      UTSNAME INFORMATION
      ===================

      sysname : Linux
      nodename : peak-pc.engr.sgi.com
      release : 2.2.13
      version : #1 SMP Fri Nov 5 02:59:34 PST 1999
      machine : i686
      domainname : engr.sgi.com

      ===============
      LOG BUFFER DUMP
      ===============

      Linux version 2.2.13 (root@peak-pc.engr.sgi.com) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 SMP Fri Nov 5 02:59:34 PST 1999
      mapped APIC to ffffe000 (0026f000)
      mapped IOAPIC to ffffd000 (00270000)
      Detected 348940216 Hz processor.
      Console: colour VGA+ 80x25
      Calibrating delay loop... 348.16 BogoMIPS
      Memory: 95448k/98304k available (1100k kernel code, 424k reserved, 1268k data, 64k init)
      Checking 386/387 coupling... OK, FPU using exception 16 error reporting.
      Checking 'hlt' instruction... OK.
      POSIX conformance testing by UNIFIX
      per-CPU timeslice cutoff: 100.26 usecs.
      CPU0: Intel Pentium II (Deschutes) stepping 02
      SMP motherboard not detected. Using dummy APIC emulation.
      PCI: PCI BIOS revision 2.10 entry at 0xfcaee
      PCI: Using configuration type 1
      PCI: Probing PCI hardware
      Linux NET4.0 for Linux 2.2
      Based upon Swansea University Computer Society NET3.039
      NET4: Unix domain sockets 1.0 for Linux NET4.0.
      NET4: Linux TCP/IP 1.0 for NET4.0
      IP Protocols: ICMP, UDP, TCP
      Starting kswapd v 1.5
      Detected PS/2 Mouse Port.
      Serial driver version 4.27 with no serial options enabled
      ttyS00 at 0x03f8 (irq = 4) is a 16550A
      ttyS01 at 0x02f8 (irq = 3) is a 16550A
      pty: 256 Unix98 ptys configured
      PIIX4: IDE controller on PCI bus 00 dev 39
      PIIX4: not 100% native mode: will probe irqs later
      ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
      ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:pio
      hda: WDC AC24300L, ATA DISK drive
      hdc: NEC CD-ROM DRIVE:28C, ATAPI CDROM drive
      ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
      ide1 at 0x170-0x177,0x376 on irq 15
      hda: WDC AC24300L, 4112MB w/256kB Cache, CHS=524/255/63, UDMA
      hdc: ATAPI 32X CD-ROM drive, 128kB Cache
      Uniform CDROM driver Revision: 2.56
      Floppy drive(s): fd0 is 1.44M
      FDC 0 is a National Semiconductor PC87306
      (scsi0) found at PCI 14/0
      (scsi0) Narrow Channel, SCSI ID=7, 3/255 SCBs
      (scsi0) Warning - detected auto-termination
      (scsi0) Please verify driver detected settings are correct.
      (scsi0) If not, then please properly set the device termination
      (scsi0) in the Adaptec SCSI BIOS by hitting CTRL-A when prompted
      (scsi0) during machine bootup.
      (scsi0) Cables present (Int-50 YES, Ext-50 NO)
      (scsi0) Downloading sequencer code... 413 instructions downloaded
      scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.20/3.2.4

      scsi : 1 host.
      (scsi0:0:6:0) Synchronous at 20.0 Mbyte/sec, offset 15.
      Vendor: IBM Model: DDRS-34560 Rev: S97B
      Type: Direct-Access ANSI SCSI revision: 02
      Detected scsi disk sda at scsi0, channel 0, id 6, lun 0
      scsi : detected 1 SCSI disk total.
      SCSI device sda: hdwr sector= 512 bytes. Sectors= 8925000 [4357 MB] [4.4 GB]
      3c59x.c:v0.99H 11/17/98 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/vortex.h tml
      eth0: 3Com 3c905B Cyclone 100baseTx at 0xdc00, 00:c0:4f:90:6e:54, IRQ 11
      8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
      MII transceiver found at address 24, status 786d.
      MII transceiver found at address 0, status 786d.
      Enabling bus-master transmits and whole-frame receives.
      Partition check:
      sda: sda1 sda2 sda3
      hda: hda1 hda2
      VFS: Mounted root (ext2 filesystem) readonly.
      Freeing unused kernel memory: 64k freed
      EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended
      EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended
      dump_open(): dump device opened: 0x803 [sd(8,3)]
      Adding Swap: 130748k swap-space (priority -1)
      Adding Swap: 130748k swap-space (priority -2)
      Kernel panic: User created crash dump
      Dumping to device 0x803 [sd(8,3)] ...
      Writing dump header ...
      Writing dump pages ...

      ====================
      CURRENT SYSTEM TASKS
      ====================

      ADDR UID PID PPID STATE PRI FLAGS MM NAME
      ================================================ ==============================
      c0234000 0 0 0 0 0 0 c0215320 swapper
      c5ffa000 0 1 0 1 20 100 c5fb4060 init
      c5fe8000 0 2 1 1 20 40 c0215320 kflushd
      c5fe6000 0 3 1 1 20 40 c0215320 kupdate
      c5fe4000 0 4 1 1 20 840 c0215320 kpiod
      c5fe2000 0 5 1 1 20 840 c0215320 kswapd
      c59ec000 1 248 1 1 20 140 c5fb4260 portmap
      c5686000 0 263 1 1 20 140 c5fb4460 ypbind
      c578c000 0 270 263 1 20 140 c5fb44e0 ypbind
      c5644000 0 324 1 1 20 140 c5fb42e0 syslogd
      c5602000 0 335 1 1 20 140 c5fb43e0 klogd
      c55c4000 0 349 1 1 20 40 c5fb4560 atd
      c5c3c000 0 363 1 1 20 40 c5fb41e0 crond
      c55a2000 0 381 1 1 20 140 c5fb45e0 inetd
      c5518000 0 395 1 1 20 140 c5fb4660 snmpd
      c5348000 0 409 1 1 20 40 c5fb46e0 named
      c52fe000 0 423 1 1 20 140 c5fb4760 routed
      c5272000 0 437 1 1 20 140 c5fb47e0 xntpd
      c523e000 0 451 1 1 20 140 c5fb4860 lpd
      c51e4000 0 469 1 1 20 140 c5fb48e0 rpc.statd
      c5194000 0 480 1 1 20 40 c5fb4960 rpc.rquotad
      c5174000 0 491 1 1 20 40 c5fb49e0 rpc.mountd
      c5158000 0 515 1 1 20 140 c5fb4ae0 rpc.rstatd
      c513e000 0 529 1 1 20 140 c5fb4a60 rpc.rusersd
      c511e000 99 543 1 1 20 40 c5fb4b60 rpc.rwalld
      c50f6000 0 557 1 1 20 140 c5fb4be0 rwhod
      c513c000 0 577 1 1 20 140 c5fb4360 rpc.yppasswdd
      c5078000 0 589 1 1 20 140 c5fb4ce0 amd
      c5086000 0 591 1 1 20 40 c0215320 rpciod
      c504a000 0 592 1 1 20 40 c0215320 lockd
      c4f54000 0 626 1 1 20 140 c5fb4de0 sendmail
      c4f22000 0 641 1 1 20 140 c5fb4d60 gpm
      c4e12000 0 655 1 1 20 140 c5fb4e60 httpd
      c4e0a000 99 658 655 1 20 140 c5fb4ee0 httpd
      c4e06000 99 659 655 1 20 140 c5fb4f60 httpd
      c4dfc000 99 660 655 1 20 140 c4dfe040 httpd
      c4df2000 99 661 655 1 20 140 c4dfe0c0 httpd
      c4de8000 99 662 655 1 20 140 c4dfe140 httpd
      c4dde000 99 663 655 1 20 140 c4dfe1c0 httpd
      c4dd4000 99 664 655 1 20 140 c4dfe240 httpd
      c4dcc000 99 665 655 1 20 140 c4dfe2c0 httpd
      c4dc0000 99 666 655 1 20 140 c4dfe340 httpd
      c4db6000 99 667 655 1 20 140 c4dfe3c0 httpd
      c4a28000 0 699 1 1 20 140 c4dfe540 smbd
      c499e000 0 710 1 1 20 140 c4dfe4c0 nmbd
      c4658000 9 767 1 1 20 40 c4dfe840 actived
      c48a2000 0 806 1 1 20 100 c5fb4c60 mingetty
      c4928000 0 807 1 1 20 100 c5fb4160 mingetty
      c4904000 0 808 1 1 20 100 c4dfe7c0 mingetty
      c498a000 0 809 1 1 20 100 c4dfe440 mingetty
      c4766000 0 810 1 1 20 100 c4dfe5c0 mingetty
      c4c6e000 0 811 1 1 20 100 c4dfe640 mingetty
      c479e000 0 812 1 1 20 100 c4dfe8c0 getty
      c4798000 0 817 381 1 20 100 c5fb40e0 in.rlogind
      c4976000 0 818 817 1 20 100 c4dfe740 login
      c45b8000 0 819 818 1 20 100 c4dfe6c0 tcsh
      c5204000 0 838 819 0 20 0 c4dfe940 crashdump

      ===========================
      STACK TRACE OF FAILING TASK
      ===========================

      ================================================ ================
      STACK TRACE FOR TASK: 0xc5204000 (crashdump)

      0 __dump_execute+153 [0xc010da21]
      1 dump_execute+149 [0xc011b925]
      2 panic+167 [0xc0114b6f]
      3 sys_setpriority+25 [0xc0115689]
      4 system_call+45 [0xc0107a61]
      ================================================ ================

  42. Memory image will help. by Anonymous Coward · · Score: 0
    Having the memory image will help. What would help more, is to know how the memory image got there!

    We will then know, where it crashed. What is more important, is why it crashed! By the time it crashed, the real cause of the crash could be 10 layers away.

    Any more information will help though.

    What would be better, is not crashing. :) Linux crashing? Is that a Windows emulation feature?

    Injured software engineer wins against Mattel!

  43. Re:Do we really need this? by mangino · · Score: 1

    I don't normally have problems with crashes either. Currently, however, I am working on kernel modules for solaris. Until I learned how to use adb on the kernel crash dump, debugging was impossible. Now it is relatively easy, just use adb -k unix.0 vmcore.0 and $c will show you the call stack. This works great for debugging kernel level drivers and modules. I can't wait to try this under linux!
    --
    Mike Mangino Consultant, Analysts International

    --
    Mike Mangino
    mmangino@acm.org
  44. This is great - now for truss by Anonymous Coward · · Score: 0

    I think this is a great detail that's needed for enterprise stuff. What about the Sun truss command?

    1. Re:This is great - now for truss by JatTDB · · Score: 1

      Dunno about you, but my FreeBSD boxes all have truss...if Linux doesn't have a version of truss written for it yet, somebody definitely needs to work on it.

      --
      "That's Tron. He fights for the Users."
    2. Re:This is great - now for truss by Anonymous Coward · · Score: 0

      The "strace" command will do what you need in Linux right now.

    3. Re:This is great - now for truss by Anonymous Coward · · Score: 0

      Thanks for the info!! -Nathan

  45. Re:WOW DOOD by Anonymous Coward · · Score: 0

    Rather amusingly (and as you can see from a few dozen posts up above), NT actually does have this ability.

  46. Re:This is a Good Thing (tm)! by hugui · · Score: 1

    Hey AC, why don't you get your facts right ?

    1 - the SGi hardware is amazing. They make some of the finest machines available ( Octane 2000, O2 ), and they achieve a level of parallelism that Linux still dreams about ( in an allways undervalued Irix machine )

    2 - their contributions to Linux have been very good and well received. KDBG is a very useful tool, and coupled with LKCD will make kernel and driver development a lot easier. GLX iwill be used in XFree86 4.0, they're working in the Linux for Merced port, etc.

    3 - OS scalability: have you ever heard of IRIX and its support of more than 64 CPUs ?

    4 - PROVEN fact that Linux never crashes is bullshit. It's just an OS like any other that can also crash, in partiuclar during development releases. I'm doing multiprocessor research on it and today I made it crash twice. I like Linux very much, but I try to keep my eyes open.

    Finally, what have you done for Linux lately ? SGI has been supporting Linux constantly during this last year, and they don't deserve to be treated this way ( remember, there're people hard working there that contributes their code under the GPL )

    Enough, you don't even deserve my time answering your stupid post. Go back to your perl scripts ( or was it VB ? ).



  47. Clang clang clang goes the trolley by Anonymous Coward · · Score: 0



    Whoa! This is great! I'm so happy for Microsoft. I'ts about time the government let them break up, instead of forcing them to remain a monopoly.

    I just hope this means my phone bill will go down.

  48. Re:Netware by Anonymous Coward · · Score: 1

    The resource kit includes tools to interpret the core dump and regurgitate the BSOD contents (which, BTW, almost always points to a video driver file). If that isn't good enough go over to www.sysinternals.com where there is a utility that saves the screen contents specifically.

  49. Re:I'm missing something... by Anonymous Coward · · Score: 0

    There are "checked/debug" versions of all MS OS' that any MS developer (belonging to the MSDN subscription gets), including 2000. This includes the entire symbol table, etc. Additionally, the BSOD's are virtually ALWAYS a third party driver running in protected mode that decided to take down the party, meaning the information is critically valuable.

  50. Re:WOW DOOD by Anonymous Coward · · Score: 0

    Ironically your "NT user" impression leaves you looking far like your average slackjawed Linux yokel looking for some friends by joining the cult.

  51. Here's one situation where it wouldn't help.... by Skratch · · Score: 1

    I've got this one computer at home that crashes EVERY time it hits runlevel 0. I think it's got somehting to do with apmd, but anyways, it's one of those computers that can turn themselves off through software. Normally, the last thing you see is:

    Stopping all md devices.
    System is halted.
    Power down.

    And at that point, you either have to turn it off yourself, or the software (apmd?) does it for you. Well this one box I have (a crappy HP I got for free) gets right to the words "power down" and then it dumps all sorts of crap onto the screen, including the values in the CPU's registers, and what I assume to be some crap from memory. What I'm thinking here though, is that since all the filesystems are already unmounted, LKCD wouldn't make a lick of difference for me. Am I right in assuming this?

    --

    -- My neighbors dog has a four inch clit.
    1. Re:Here's one situation where it wouldn't help.... by kcarnold · · Score: 1

      If you got problems with APM, recompile your kernel and disable APM support. (It can't be apmd because init killed that a long time before.) If your APM is otherwise working fine, just disable the "Power Down on Halt" option. Then your computer will be a little quieter. You might want to remount everything read-only (mount -o remount,ro) and try 'apm -s' and 'apm -S' to see if your APM is messed up. Could be BIOS settings.

      I should stop trying to be a troubleshooter... well, I don't think LKCD would make any difference in this situation because you don't care. If it's a problem that happens when your computer should be dormant anyway, why bother trying to figure out what's wrong? Especially when there are better ways to go about doing that (fiddling with BIOS settings and such).

      And what are md devices, anyway? And why should I care? It magically disappeared when I got a new kernel, so I don't care anymore.

      Kenneth

    2. Re:Here's one situation where it wouldn't help.... by Skratch · · Score: 1

      Well, the only reason I care is that it doesn't work correctly. It's a problem, perhaps in the kernel, that, while insignificant, is still a problem. I'd like to see Linux be as nifty as it could be. I can see the kind of treatment people would give windows if it blue screened right after blinking "It's now safe to turn of your crappy winbox".

      --

      -- My neighbors dog has a four inch clit.
    3. Re:Here's one situation where it wouldn't help.... by Anonymous Coward · · Score: 0

      It (on IRIX at least) writes to the swap partition, not a filesystem. So it *should* work for you.

    4. Re:Here's one situation where it wouldn't help.... by yakker · · Score: 1

      Since the LKCD product dumps out to a SCSI swap partition, it doesn't matter if it is mounted or not. As long as SCSI interrupts still work, you should be able to create a crash dump.


      I'd be very interested to see the dump ...


      --Matt

  52. Welcome the 20th century! by Anonymous Coward · · Score: 0
    Oh, wow! Linux, the operating system of the Gods, now has something that *BSD and every commercial vendor has had since the begining of time.

    Way to go, guys! Welcome to the 20th century!

    Perhaps you'll even eventually make it to the 21st.

  53. Re:Uhhhh, This isn't a new thing. by Chaostrophy · · Score: 1

    At least two journaled file systems will be in 2.4. Reiserfs and ext3 should both make it in. Posibly XFS as well (not heard any news about that one). Hey, and 64GB max memory in 2.4 as well (still 4GB max per process though).

    --
    Plato seems wrong to me today
  54. RTFM by _damnit_ · · Score: 1

    " Memory dump files are created when a STOP error occurs, and the system is set to save debug information in the 'Startup/Shutdown' tab of the 'System' Control Panel."
    source: support.microsoft.com


    This is a real easy one to setup. The feature's not usually used on small workgroup servers because there's usually no one around who can do anything with a 256MB binary. I was going to say a lot of nasty things about dumb NT admins, but I thought I'd be nice as I was one (and will be again if the money's right).
    It's better to be uninformed than misinformed.


    _damnit_

    --


    _damnit_

    It's my job to freeze you. -- Logan's Run
    1. Re:RTFM by jafac · · Score: 2

      For me it's not the size of the binary file - I was already used to 256 - 512 Meg binary memory dumps of Netware (I guess now they have some sort of selective tool for NW 5 that makes the dumps much smaller).

      What others have pointed out, and what I was saying is that, on my side of the phone, that does me NO good. Unless my customer has MS Visual Studio on it (which, by the way, usually screws up the delicate mix of MFC dlls and causes problems of it's own), this file is useless.

      On Netware, you could drop into the debugger, even on a production file server, check a few pages of the stack, and the registers, and jump back, and quite often, not greatly interrupt service (if you did it within the timeout period of the Netware Clients). Not possible on NT.
      You learned ONE tool, console debugger, and it was the SAME interface and commands as the tool that examined the coredump files on your DOS machine at your leisure. I quite often used to talk customers through debugger sessions on the phone to gather information. Even when the customer was totally non-technical. This is not possible for NT, because if you're lucky, and the customer HAS a debugger, it most likely isn't one you're familliar with. And if the customer is non-technical (MUCH more likely on NT than Novell), again, you're SOL.

      On NT, your ONLY option is, 99% of the time, is to get the memory file transported to you (FTP or whatever), and send it to a programmer who has the time and the very expensive software to debug it. With Novell, ME, a non-programmer, a tech support guy, without a costly subscription to MSDN, without a costly copy of MS Visual Studio or SoftICE, could quickly and cheaply debug problems, compare the call stack to other incidents to see if it's a similar problem, or distill the pertinant information down to a paragraph or two and email it to a developer for debugging or suggestions; and if that wasn't enough info for the developer, I could THEN resort to getting the whole file, or trying to reproduce the problem.

      The thing is, the integrated debugger solution gave support some granularity in how much resources were devoted to a problem. Now, we either have to be equipped like a developer (COSTLY!), or we have to forward MOST cases to a developer (COSTLY, and ugly!).

      I know that Microsoft's reason for this was that an integral debugger compromises security (in theory, you could look up user-data with it that you wouldn't otherwise have rights to).
      IMO, this is totally lame, because if the administrator was worried about security, the debugger could be disabled and locked out. And for cases where a debugger was needed, the administrator could go into the user setup, and check the box that enables the debugger.
      The real reason was probably so they could increase the value of MS Visual Studio, and ask a higher price. Before NT came out, a debugger was commonly considered a necessity of life, and I can't think of one OS (other than Windows95) that a debugger didn't ship free of charge with (even dos had DEBUG.COM).

      I wish I had a nickel for every time someone said "Information wants to be free".

      --

      These are my friends, See how they glisten. See this one shine, how he smiles in the light.
    2. Re:RTFM by Anonymous Coward · · Score: 0

      You really just miss DDT on your good old CP/M machine.

      Why don't you go back and use that?

  55. wrong about NT by _damnit_ · · Score: 1

    " Memory dump files are created when a STOP error occurs, and the system is set to save debug information in the 'Startup/Shutdown' tab of the 'System' Control Panel."
    source: support.microsoft.com


    _damnit_

    --


    _damnit_

    It's my job to freeze you. -- Logan's Run
  56. bad idea by LocalYokel · · Score: 1

    That wouldn't really solve the problem -- obviously, someone has a cronjob that runs a script to scan for new /. stories.

    Naturally, there are people and things we'd rather not deal with, but just like IRL, it's unavoidable. If that thought is too traumatic to deal with, you have two options:

    1. Set your threshold to > 1
    2. Keep your threshold to < 2, and just wait for the article to collect ~30 comments.

    Any kind of censorship (including an IP ban) is bad, bad, bad -- but what do I know? I'm just as much a part of the problem as anyone else.

    --

    --
    E2 IN2 IE?

  57. Oops tracing is fun! by Kaz+Kylheku · · Score: 2

    A good hacker should be able to do with just a register dump, stack trace and some program text surrounding the instruction pointer where things went belly up.

    Hacking the kernel is supposed to be hard and tracing crashes given minimal information is a big part of the fun and attraction of ``iron man'' programming.

    Then again, having a full dump doesn't necessarily make debugging that much easier. It's an incremental improvement over oops text.

    Here is the real advantage: a dump is good from the point of view of users who need to report crashes to developers. I think that even a hack to get oops text (rather than a full dump) written to a partition would be better than asking the poor user to copy the oops text appearing on the frozen console down on a piece of paper! Forget it!

    1. Re:Oops tracing is fun! by Salamander · · Score: 1

      >Hacking the kernel is supposed to be hard and tracing crashes given minimal information is a big part of the fun and attraction of ``iron man'' programming

      That's fine if your goal is to compensate for deficiencies elsewhere in your life by making yourself feel like an "iron man" programmer. If your goal is instead to produce working quality kernel code, you eventually ask yourself "why make an intrinsically difficult task even more difficult by not using the best tools available"?

      If self-flagellation or self-denial are good things, might as well go all the way, right? Go build your own computer...from sand and copper ore, using no tools but those you make yourself. Come back when you're done.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    2. Re:Oops tracing is fun! by seebs · · Score: 1

      Oops tracing may be fun, but if you're on any kind of schedule, or you're doing commercial support, crash dumps are *well* worth it.

      --
      My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
  58. Yeah, but debugging is a real pain by Owen+Lynn · · Score: 1

    You need TWO PeeCees hooked together by serial
    port. Then you put one computer in "debug boot
    mode", and control the debugger using the other.

    Feh.

    On Solaris, you just grab the core and symbol
    files, and use adb. On just one computer, with
    no special boot modes, with the machine running
    whatever.

    Having this ability on linux will be very very
    convenient.

  59. Re:This would be GREAT if the Linux kernel crashed by Xtacy · · Score: 1

    yup once, accessing a floppy drive, dont know why, dont want to know why cuz it's never happened since :)

  60. Is this new? by Anonymous Coward · · Score: 0

    It seems like a fancier and easier to use method for dealing with kernel "oops" files. The kernel source tree has always had instructions on how to debug a kernel crash.

    What else is different about this new SGI stuff?

  61. Re:This would be GREAT if the Linux kernel crashed by Anonymous Coward · · Score: 0

    Never seen a panic, but I did once see an Oops. I thought that was funny, the documentation said oopses were all but nonexistant. If I knew what did it (the box was a server and i didn't see the oops text until a week later) and could reproduce it, I'd have reported.

  62. Re:'bad programs' by Eric+Smith · · Score: 2
    And if a problem in a userland program causes the kernel to crash, not only is the userland program possibly broken, but the kernel is definitely broken. This core dump feature will help debug such problems.

    I've personally never seen a userland program crash the Linux kernel. The closest I've come is having bugs in the X server lock up the keyboard and display, but the machine was still running fine in all other regards, and I was able to telnet in and initiate a clean reboot.

  63. Re:This implementation is much less than what BSD by Eric+Smith · · Score: 2
    This is *crazy*! That's like, uhm, like sort of a hack perpetrated by someone who was in a hurry and didn't know about prior art.
    Hey, give them a break. This is just the first release; if people like it and encourage them (or work on it themselves), it will undoubtedly get better with time. After all, ROM wasn't built in a day. :-)
  64. Re: Flamebait by krieg · · Score: 1

    Couldn't agree more, we have a farm of O2s and they crash like hell, by Unix standard that is; almost one crash per month. We had some code that would systematically crash machines with panic error. I just hope that what SGI has to contribute to Linux is not instability!!!!

  65. Try 'strace' by polarbear · · Score: 1

    strace -p pid

    Hope this helps...

    --
    --- polarbear
  66. Great things from SGI... by Anonymous Coward · · Score: 0

    The best part isn't this particular feature (although it's a good thing). The important point to take away from this is that companies with a vested interest in particular markets are contributing to Linux.

    SGI has a lot of expertise in building enterprise-class software and you can bet that there's more good stuff to come. Corel is doing interesting work on the user interface and will probably contribute lots of neat stuff to Debian. These are companies that would never collaborate directly on a product, but through Linux they end up contributing to each other and to the market as a whole.

    We live in interesting times...

  67. They aren't there quite yet!! by Anonymous Coward · · Score: 0

    This software does not allow a debugger to operate on a dump file; instead they introduced a new program (lcrash) which allows the user to "interact" with the image. And this new program must be recompiled every time the kernel changes. Say hello to version skew! If it took Linux this long to get to this point, who knows how long before they'll actually be able to use a real debugger.

  68. NT does this by Anonymous Coward · · Score: 0

    NT has had this for years, where have you been? Hiding under the Unix rock? Seriously, if more Unix people would actually try NT, they'd realize there is no need for Unix anymore :-)

    1. Re:NT does this by poopie · · Score: 1

      >NT has had this for years, where have you been?
      running mission critical apps on unix with fantastic uptime.

      >Hiding under the Unix rock?
      if that means remotely administering and monitoring hundreds of multiuser servers from your office, then ... yes.

      >if more Unix people would actually try NT...
      sorry, we have,we did, and finally our managers have stopped making us try to deploy on it... I developed a deathly serious allregic reaction to BSOD and rebooting, and I'm not going to try windows again until it includes a decent shell like bash, ksh, tcsh and an xserver.

    2. Re:NT does this by Anonymous Coward · · Score: 0

      sorry, we have,we did, and finally our managers have stopped making us try to deploy on it...

      So did you take a cut in pay at that point in time with your demotion?


      I'm not going to try windows again until it includes a decent shell like bash, ksh, tcsh and an xserver.

      http://www.interix.com

    3. Re:NT does this by poopie · · Score: 1

      >So did you take a cut in pay..
      no the managers got re-orged or got the boot.

      >http://www.interix.com
      fist off, interix or exceed aren't free, and by the time I get exceed approved for 20 users, I could have upgraded them to linux.

      all the interix stuff makes it painfully clear that Micros~1 never intended to be interoperable with unix, and didn't want file or app sharing. Jeez. look at how ultimately feeble the standard windows telnet is. Look at the lack of scripting in command.com and cmd.exe...

      I'm pretty underwhelmed with interix and most PC xservers to be honest.. I'd be happy if the windows desktop supported x apps transparently and I could "xhost +backup1 server2 mail4 dev344" from a dos shell, telnet to a unix server from that dos shell without spawning a "telnet" windows app, set my display back and launch an app. -- oh, but that's exactly the level of interoperability that Micros~1 never wanted to achieve. Actually, I want to be able to use a nis/nis+ server for my NT logins without a hack.

      When Micros~1 find someone who can implement all that functionality in a shipping version of windows as standard features, I'll consider revising my opinion.

      xservers on windows: like putting lipstick on a pig.

  69. This is GREAT news! by dsaxena · · Score: 1

    As someone who has been doing Linux device driver development for about a year and gotten annoyed at the lack of kernel development tools, it's really nice to see this. Now if only Linus would make Andrea Arcangeli's Intergrated Kernel Debugger a part of the standard tree, it would make day.
    --
    This comment is (©) Copyright Deepak Saxena.

    --
    Deepak Saxena
    "Computers are useless, they can only give you answers" - Picasso
  70. Linux kernel tools need work by tilly · · Score: 2

    This is a good thing, but it is part of a more general problem.

    And that problem is that we accept tools for Linux development that are distinctly sub par. There is a lot that could, and should, be done.

    I would say more, but I cannot possibly say it better than this rant does.

    Cheers,
    Ben

    PS The Microsoft program works right and has a bad interface, the Linux program has a nice interface but sucks! Whodathunkit? (Read the link.)

    --
    My usual seat in the cluetrain is at A HREF="http://pub4.ezboard.com/biwethey.ht
  71. Not JUST a core dumper by grumpy_geek · · Score: 2

    Having done just a very quick glance over the specs I may be wrong, but I believe they are doing what they have been doing on the SGI for awhile. When a SGI running a newer flavor of IRIX does a system panic (SCSI, memory, whatever) it dumps a core out. Dumping this file is not for the drivespace week, if you have half a gig or ram you have a half a gig core file, but the beauty of this is it then automatically examines the core file and tries to figure out what killed it, you don't have to go in and run the debugger yourself.

    Having the machine tell you what memory page you were at when it took a dive makes life much nicer for the harried admin; of course if you want to dig through a core at a later time with your debugger you can but it gives you a good starting point, and tends to make tracking things down much quicker since you have a guess as to where the problem resides. Having your box tell you that you had a memory error in SIM 3 bringing the box down, having analysed the core file before you even have a chance to fire up your debugger, is a pretty nice thing.

    Of course this is dependant upon my assumption that it works in the same kind of fashion as Irix (which it seems to).

  72. Re:Live debugging by Anonymous Coward · · Score: 0

    There are kernel debuggers for the linux kernel, linus just doesn't include them in the standart kernel.

    one is ftp://e-mind.com/pub/andrea/kernel-patches/ikd/
    and another one is from SGI: http://oss.sgi.com/projects/kdb/

  73. Re:sun suing by Anonymous Coward · · Score: 0

    duh i'm sure he knows irix is sgi's he's saying sgi already had it so sun can't sue

  74. Live debugging by Anonymous Coward · · Score: 0

    Linux will probably be a "modern" OS when you can debug a live system (either remotely or via an in-kernel debugger). This is just one aspect where Linux shows it's (lack of) maturity. *BSD is so much nicer to develop for.

    1. Re:Live debugging by tob · · Score: 1
      You want to be able to debug a running linux system remotely? Would running gdb remotely through a serial line work for you? Nobody is claiming (well, at least I'm not :) that linux is perfect. It's just that it's moving towards greatness with incredible speed.

      http://oss.sgi.com/projects/kgdb/

  75. Re:I'm missing something... by Anonymous Coward · · Score: 0

    >What exactly are you supposed to do with a kernel core dump under a closed source OS? Your supposed to put on aluminum foil suits and dance around.

  76. Re:Netware by Anonymous Coward · · Score: 0

    One of the most recent times NT crashed on me (I am not running it at present) the BSOD (which, you might be interested to know, is a debugger's dump screen) contained enough information to debug what was the problem.

    That isn't always the case, of course, and there isn't a real easy way to record the info on a BSOD screen (a large 'scope camera?)

  77. Re: NT Memory dump by Anonymous Coward · · Score: 1

    Actually in the support directory on the CD's are both the kernal symbols and i386 kd. Now there's not a lot of documentation on this on the Cd, but if you buy the Book Inside Windows NT you will get an introduction to the kernal debugger. (Also some of this information is in the Device Driver kit). If you are supporting NT, you need at the Platform SDK and the DDK kit.

  78. Don't bother with NT for serious work by Morgaine · · Score: 2

    But a fat lot of good a kernel debugger does you on a closed-source OS.

    NT had the future almost in its grasp, but let it slip away by being impossibly unreliable and horribly admin-unfriendly compared to any Unix product. [We worked with it for a year but eventually had to discard it as a worthless toy.)

    But that was then. Now it's just plain obsolete. Face it.

    --
    "The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
  79. Re: NT Memory dump by Anonymous Coward · · Score: 0

    Microsoft learned something in Kindergarten. The teacher said "You have to bring enough for everybody" so it stopped bringing anything interesting for show-and-tell.

    In third grade, Microsoft learned that the teacher who said "I am holding you back because the rest of the class needs to catch up" really was serious.

    In seventh grade it learned in "social studies" class that "we owe it to those less fortunate to ourselves to share what we have."

    By eight grade it realized the teachers were just not very bright people. It read in a book somewhere (off campus) the maxim "If you can't do it, teach it."

  80. This may be new to Linux, but... by Anonymous Coward · · Score: 0
    ...this is NOT a new thing. Other UNIX OSes have had this for YEARS. Before taking my current job, I spent ten years doing kernel development, on four different OSes. Every single one of them could do this. I'm shocked that Linux doesn't. It didn't even occur to me that it didn't.

    Uh....there's a kernel debugger for Linux, isn't there? Please say yes.....

  81. Re: Just new to Linux by Anonymous Coward · · Score: 0
    Other OSes have been able to do this for years. I worked in kernel development for ten years, and every one of the *nix kernels I worked on could do this. It would dump the kernel image to the swap space, and when the system came back up it would transfer the images from swap to the filesystem.

    I'm really surprised Linux didn't already have this. I haven't done any kernel stuff for Linux, so it never occurred to me that it didn't already exist.

    There's a kernel debugger, right? Hit a special key sequence and the whole system stops, so you can look around at data structures? Is there? Someone please say yes...

  82. Isnt Linus Against this? by Anonymous Coward · · Score: 0

    IIRC, Linus is against this whole idea for a long time, and i think i agree with him. The resource crunching is simply not worth the feature. Ooops tracing is just as good. I dont think this should be in the kernel but maybe as a separate patch maintained by SGI.

    1. Re:Isnt Linus Against this? by yakker · · Score: 1

      How is this "resource crunching"? I agree Linus may have a problem with having this in the kernel, but certainly not because it consumes resources.


      There's only one resource it currently "consumes", and that's about 64K of decicated memory. We're looking to eliminate that requirement in the next release.


      I guess the real test will be asking Linus ...


      --Matt

  83. Re:'bad programs' by Anonymous Coward · · Score: 0

    "I've personally never seen a userland program crash the Linux kernel."

    Hah - The very first day I used Linux (RedHat 6.0) the Gimp rebooted my machine twice! That was some two months ago by now, but had I had that crash dump program I might have contributed in a more informative way than merely posting my ignorant rantings.

  84. Re:This is a Good Thing (tm)! by Anonymous Coward · · Score: 0

    can you say "flamebait"?
    -dilinger, who forgot his slashdot passwd 2 years ago

  85. NT did this already by Anonymous Coward · · Score: 0

    Where was everyone's brain on this? NT has done this for a long time, the dump can be found in the page file.

    Finally a way to find out why X is always crashing...

    --- Anonymous Coward

    Console apps blow!

  86. Re:sun suing by Abigail-II · · Score: 1
    And there are some non-Unix OSses that do that as well. In a previous job, we had a Netapp machine. If it crashed, you could dump out the kernel - which was cool because then you could ftp 80 Mb (or something like that) of core dumps to Netapps, and they would have a look. Which sometimes resulted in kernel patches a couple of weeks later.

    -- Abigail

  87. Re:This would be GREAT if the Linux kernel crashed by DGolden · · Score: 1

    Maybe you're an unusually strong source of gamma radiation :-)

    --
    Choice of masters is not freedom.
  88. wrong again by _damnit_ · · Score: 1

    NT saves it in swap as long as the swap partition is on the same drive as the boot partition. Again, RTFM. The serial port option is only an option and is NOT the default or usual manner for memory dumps.
    As someone else in this thread commented, the dump is of the entire contents of memory. This changes in Win2000, but I have not personally seen this.


    _damnit_

    --


    _damnit_

    It's my job to freeze you. -- Logan's Run
  89. Re:I want my BSOD! by bugg · · Score: 1

    Look into the KDE screensaver.

    the BSOD is alive and well on my FreeBSD desktop.

    --
    -bugg
  90. This implementation is much less than what BSD has by Anonymous Coward · · Score: 0

    Saving the memory image is where the similarity ends. In *BSD,Solaris, NT, etc. you also have a *real* debugger to operate on the crash dump with. In linux, you have a half-ass tool that lets you inspect structures and needs to be recompiled EVERY TIME YOUR KERNEL CHANGES. It's a step forward for Linux, but it's still extremely primitive when compared to the kernel development environment on other OS's.

  91. Re:Netware by jafac · · Score: 1

    hm. If I was granted moderator points more often than once every 9 months, I'd moderate you up.

    This is JUST what I'm looking for, well, it answers most of my complaints. Unfortunately, it does seem to be W2K only, not NT 4, which is likely to represent 99% of my install base for well past 12 months. (I seriously doubt that there will be any significant migration to W2K until this time next year. Oh, there will be a few early adopters, but among MY customers - almost nobody plans on putting it into production).

    I wish I had a nickel for every time someone said "Information wants to be free".

    --

    These are my friends, See how they glisten. See this one shine, how he smiles in the light.
  92. WOW DOOD by Anonymous Coward · · Score: 0

    Im Glad to see that this is on LINUX because EVERYONNE nos that LINUX is total bugy! Now you can SEE Y! i know why my NT dont need ths STUF because it Never crashes1!!1 Thats y al hte other NIXs got THIS! because THEY suck 2!1! NT ROOLS YOU LINUX SHTS! AND BSD SUX!

    1. Re:WOW DOOD by Anonymous Coward · · Score: 0

      Well ......... I once had a friend who talked like that I repeat I once HAD a friend who talked like that. So... you really should stop talking like that

    2. Re:WOW DOOD by Anonymous Coward · · Score: 0

      Nice try, Linus.

      Go to your room. There will be no PowerPoint at all for you tonight!

    3. Re:WOW DOOD by Anonymous Coward · · Score: 0

      Well I'm the AC that made the post above with the NT ROOLS in it. Wanted to play flamebait with the moderators.

      Do you have any IDEA how hard it is to mangle language THAT much? God..

      Scary thing, it's still Score:0

      heh..

  93. Softick is a free debugger for NT by Anonymous Coward · · Score: 1

    http://softick.8m.com/

    Possibly others would work as well. Check out http://www.suddendischarge.com/debugg ers.html for just about every (free/shareware) debugger ever made.

  94. Re:Uhhhh, This isn't a new thing. by Anonymous Coward · · Score: 0

    However, Linux is wrapped around a poor non-standard TCP/IP stack, which won't change anytime soon.

  95. Re:I doubt ext3 will be in Linux 2.4 by Chaostrophy · · Score: 1

    Well, ext3 is fairly simply, on disk it is ext2 with a log file (its self a regular file), reiserfs is coming along. What goes in may not be feature complete, but from the word on the reiserfs mailing list is that they both will make it into 2.4, and should be in 2.3 fairly soon. Perhaps they will be flagged experimental in the early 2.4 kernels.

    --
    Plato seems wrong to me today
  96. Re:I'm missing something... by DrJolt · · Score: 1

    > What exactly are you supposed to do with a kernel core dump under a closed source OS?

    Figure out what application was running when your system hung, tell your support provider, and get them to fix it.

  97. This is a Good Thing (tm)! by javac · · Score: 1

    This is great.

    I think SGI is going to more for linux than most people expect. They are helping us move into the Enterprise so much faster than I ever thought possible. You should look at their web page and see all the code they have contributed, it is very nice. SGI may be strugling, but they have a large cash reserve, and are staking their existance on Linux.

    I hope they succed and will personally see that I get as many SGI servers around here as possible

    geach

    1. Re:This is a Good Thing (tm)! by doubleyou · · Score: 1

      They make some of the finest machines available ( Octane 2000, O2 ),

      Perhaps you meant "Origin 2000", or just "Octane". :)

      Catcha' later,
      Paul.

  98. Re:FIRST POST!!! by Mr.+Piccolo · · Score: 1

    Sorry to disappoint you, but not only are you not first, but the previous _six_ posts (a new record for Slashdot!) are not lame "First Post" posts.

    --
    Glückwünsche, haben Sie Slashdot ermordet, indem Sie zum korporativen Druck beugten und Subskriptionen einlei
  99. Re:FIRST POST!!! by Anonymous Coward · · Score: 0

    http://slashdot.org/users.pl?op=userinfo&nick=meta wronka

    wow..

  100. Flamebait by Anonymous Coward · · Score: 0

    Those SGI guys are experts on crashing the kernel. Maybe that's why they came up with the XFS file system! At least if you're going to crash every week, you don't have fsck everytime.

    1. Re:Flamebait by Anonymous Coward · · Score: 0

      Actually those that are most expert at crashing the kernel, the database, and the customers customer are the fine folks from Sun. Just look at E-Bay, the encyclopedia brittanica, etc. What SGI does better than anyone out there is to build high performance engines capable of work that Sun only has wet dreams of ever doing.

      If you want to post flame bait, at least make it real. IRIX is quite stable. Certainly more so than Slowlaris. Anyone try the "64-bit" version of Slowlaris yet? You sneeze and it falls over. You fart on the way home and your pager goes off because it died.

      If you want to beat up on SGI, do it for their pathetic PR/marketing. It is so friggin hard to sell their stuff to my management, cause all the PHB-VPs see are "dot com me baby", and don't actually have to work with the stuff they drop on our heads. Well, I guess thats because Sun is working their butts off to be "just like Bill's company"(tm).

      SGI has the best hardware/software on the planet. They just don't have the stomach to sell it or tell people about it. They make my life hard. They deserve a few kicks in the teeth for that. They probably need to lose their top marketing and sales people in order to fix their problems.

      But Sun... geez. Talk about stability... To bad that e10k website went down, the reality of an E10000 and the marketing verbage just don't seem to jive for most customers.

  101. Re:sun suing by Anonymous Coward · · Score: 0

    The NetApp is a BSD machine. NetBSD if I'm not mistaking. Heavily modified of course, but deep beneath there's *BSD in the beast.

  102. XFS ?!? by Anonymous Coward · · Score: 0

    NOTICE: Starting XFS recovery on filesystem: / (dev: 0/79)

    Hey you guys have XFS working on Linux already? Sweet! Can't wait to see it on my box someday soon. Excellent work guys...

    1. Re:XFS ?!? by doubleyou · · Score: 1


      Take a closer look at that icrash report, dude. The test system was running IRIX, not Linux. Sorry to disappoint.

      Catcha' later,
      Paul.

  103. Netware by jafac · · Score: 3

    I think that this was one of the greatest features of Novell - the fact that if your server was barfing, you could go into the debugger, and neuter an offending process; or if the server was really in trouble, it would drop into the debugger, so you could at least figure out what went wrong, or dump the memory image and send it to someone who could.

    And also, it's one of the things I really, really, really HATE about NT. No debugger comes with the OS, and there's no free, distributable one out there, so from a tech support standpoint, if your customer's server barfs, you kind of have to guess at what went wrong, or establish a pattern from multiple calls, or try to reproduce it in-house. Switching from supporting Netware products to NT products has been hell, and this is 90% of the reason. This kind of thing in Linux can only help "the cause". (and because my company is working on some fairly significant Linux products, and I may end up supporting them, this makes me more optimistic about the future.)

    I wish I had a nickel for every time someone said "Information wants to be free".

    --

    These are my friends, See how they glisten. See this one shine, how he smiles in the light.
    1. Re:Netware by Ed+Avis · · Score: 2

      Doesn't NT come with that useless 'Dr Watson' thing that, whenever something crashes, wastes your time with an unkillable dialogue box? Surely it must do _something_ useful - what is this 'application error log' that it keeps trying to create?

      Also, Netscape license a thing called Full Circle that sends information back to Netscape HQ following a Navigator crash.

      --
      -- Ed Avis ed@membled.com
    2. Re:Netware by Quikah · · Score: 1

      Nt has a memory dump on crash feature.

      --
      Q.
  104. What does it do? by Anonymous Coward · · Score: 0

    Hmmm, I thought Linux never crashed? What does this dump analyzer do?

    1. Re:What does it do? by Anonymous Coward · · Score: 0

      I think it saves info on what other OS would have crashed on and point out to how Linux saves the users from going down and ending the user's 2000day uptime.

  105. Re:FIRST POST!!! by liNA-seven-nine · · Score: 1

    this guy probably running a script to post 'FIRST POST!!!' every time. or just plain _cOUgH_eGGhEad_cOUgH_.
    --

    --
    You're a cartoon of rebel! You're all like exaggerated version of yourself! - Gerard Jones
  106. Re:No one asked if this used any IRIX or UNIX code by Anonymous Coward · · Score: 0

    Why don't you write them a letter? Better, yet, why not drop by SGI and volunteer to ``help them out''. I'm sure a start-up like SGI could use volunteer ``experts'' like yourself. Maybe you could offer to do all their legal footwork.

  107. Do you have an Epox motherboard? Update your BIOS. by Anonymous Coward · · Score: 0

    My machine did this exact same thing ... it had something to do with a buggy BIOS, so if you're using an Epox motherboard you might want to see this page: http://www.epox.com/support/bios.html

  108. Good to hear, but that goes without saying. by Chip+Stillmore · · Score: 1

    Being able to view a dump of the memory at the state it was in when the crash occured is an invaluable piece of data for any developer, and/or support-type person.

    It makes the PTF/patch process go so much more easier. Of course, that's when the stack in the dump hasn't been corrupted by "unexpected" behaviour. Then, all bets are off.

    BTW, does anyone know if they have any tools tailored to viewing these dumps and being able to quickly navigate through the stack, popping/pushing when needed? That would be nice too, but I never noticed anything from the announcement, and I don't have a Linux system handy at the moment to check the package.

  109. Re:This implementation is much less than what BSD by T-Punkt · · Score: 1

    > After all, ROM wasn't built in a day. :-)
    Wich ROM? 8)

    I have the strong feeling that some ROMs of my hardware were coded in less than a single day...

  110. Re:sun suing by Guy+Harris · · Score: 2
    The NetApp is a BSD machine. NetBSD if I'm not mistaking.

    You are mistaken. NetApp boxes do *N*O*T* run any flavor of BSD as their OS.

    The underlying kernel is one written at NetApp; it doesn't support multiple address spaces, any notion of userland, or demand paging (heck, until recently, it didn't even change the page tables; it now uses the paging hardware, but only to make virtually contiguous physically-discontiguous pages, to make allocation of large chunks of memory a bit less painful).

    A significant part of the of the code did from BSD - the networking stack came from BSD (4.4-Lite, with some bits of the FreeBSD and NetBSD stacks thrown in), as did many of the commands (although those had to be chainsawed a bit to run in kernel mode in a shared address space), as did the dump and restore code (although the dump code was significantly changed to work with our WAFL file system). Various support routines also came from various BSDs as well, and the NFS server code is somewhat remotely derived from the BSD code (although it was also significantly changed to fit into our environment as well).

    However, that doesn't mean NetApp boxes run anything you'd recognize as "BSD" (and, in particular, the crash-dumping code isn't BSD-derived, although the savecore command is based on the BSD command, although, again, significantly modified to run in our environment, and to extract the core dump information from the core dump areas on the disks).

    (Yes, I know this first hand. I'm one of the developers there, and have been since early 1994.)

  111. what about a meta kernel? by Anonymous Coward · · Score: 0

    Meta

  112. Re:This implementation is much less than what BSD by seebs · · Score: 1

    You must be joking.

    Hey, GUYS!

    THIS IS A SYSTEM BASED ON GNU TOOLS!

    YOU HAVE A DEBUGGER WHICH HAS SPECIAL HOOKS FOR DEBUGGING KERNEL CORE DUMPS!

    This is *crazy*! That's like, uhm, like sort of a hack perpetrated by someone who was in a hurry and didn't know about prior art.

    Which, I guess, is the allegation that Linux always faces from people. *sigh*. Oh, well, it'll get better. :)

    --
    My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
  113. NT already has this by josepha48 · · Score: 1
    NT already has this capability in it. When NT dies it takes the memory contents and dumps it to its swap space. Then you are supposedly able to debug the BSOD. I never tried this, but it is what M$ says it can do. It is good to see that Linux is going to hopefullly have this too. Since this invloves kernel code I wonder if Linus will let it in.

    send flames > /dev/null

    --

    Only 'flamers' flame!

  114. I doubt ext3 will be in Linux 2.4 by cpeterso · · Score: 2

    The most recent Linux 2.3.25 kernel does not have ext3. ext3 is still way alpha. Linux 2.3 is already under feature freeze. If Linus plans to release Linux 2.4 by 2000 Q1, I doubt ext3 will be part of it.

  115. I'm missing something... by roystgnr · · Score: 3

    People here are saying that yes, even NT has the ability to dump kernel core when it BSODs, but:

    What exactly are you supposed to do with a kernel core dump under a closed source OS? Throw a printout of it into a bonfire to propitiate the Windows Demons? Send it to Microsoft and wait for their rigorous QA process to leap into action and send you a fixed kernel? I can't imagine trying to debug it yourself without being able to get a backtrace and look at the problem source code. Does Microsoft even leave a symbol table of internal function names in the NT kernel? What exactly do you do with a Kernel Debugger in Solaris if you can't see anything more than what a disassembler will tell you about the kernel being debugged?

    1. Re:I'm missing something... by Anonymous Coward · · Score: 0

      When NT crashes (not if) and one gets the dump file, one can examine the dump file using the flakey (but available) windbg/kd tools and the symbols. Since we no longer have to worry about NT on alpha systems, one doesn't have to remember all the different calling conventions (well, not quite yet). I must say that my x86 assembly has improved this last few weeks working on this stuff. There are a few cases where you won't be able to get a dump file on NT4 (and one is back to hoping one can repro the situation on a different system, but hopes usually fall short). I'm only thankful that when I worked on NT alpha systems that management didn't bother to take my unix alpha box away because it was a turbochannel system and could not run NT.

    2. Re:I'm missing something... by Lightstorm · · Score: 1
      > Figure out what application was running when your system hung, tell your support provider, and get them to fix it.

      If the system goes down, it's because of a problem with the OS (poorly behaved applications shouldn't be able to crash a robust operating system, unless running as root).

      So whilst it's interesting to find out which application exposed the bug, you still need enough clout to make your closed source OS vendor do something about it...

    3. Re:I'm missing something... by EasyTarget · · Score: 1

      What exactly are you supposed to do with a kernel core dump under a closed source OS?

      - Not much if running at home, but commercial customers get much, much, better support than home-jo can imagine, even from the Great (Ms)atan. They pay thousands of $K, per year, per node for the privilige. This includes 'send us the kernel dump and we'll tell you what happened. If it is in the OS or hardware we'll fix it. If not we'll tell you exactly what the app did what.'
      HP have made a kernel change because of a defect my team discovered. We paid 70k$ per year for 40 nodes. It's not perfect support, but you got generally good service.

      Throw a printout of it into a bonfire to propitiate the Windows Demons?

      - Nah, too inflamable, you wind the DLT tape of it round a stick like entrails.

      Send it to Microsoft and wait for their rigorous QA process to leap into action and send you a fixed kernel?

      - Again, they might if you are Motorola, HP, etc. Or a big OEM like Dell and want a fix so you can use cheap and nasty motherboards.

      I can't imagine trying to debug it yourself without being able to get a backtrace and look at the problem source code.

      - In your office, Take an HPUX crashdump, use Xdev and extract the panic message, even if a system halt occured. I Did it 3 weeks ago. It let me confirm that a series of panic's (HPUX saves multiple dumps) all had a common cause.

      Does Microsoft even leave a symbol table of internal function names in the NT kernel?

      - Yeah, NSAkey NSAlock NSAsniff.. I can see it now.

      What exactly do you do with a Kernel Debugger in Solaris if you can't see anything more than what a disassembler will tell you about the kernel being debugged?

      - You can see the exact state. And a decent support contract will give you a working system and a fix.

      - Look, This is of far more use to industry, So that when the web server panics for some unknown reason, they want to know -exactly why- so they can stop it happening again, they can afford to pay for the analysis.
      - This is actually quite important for the future of Linux in the commercial sector. I would bet the HP, Sun, MS et.al have used this as an small argument when 'competing' against Linus for large and prestigue orders.

      --
      "Oops, I always forget the purpose of competition is to divide people into winners and losers." - Hobbes
  116. Re:great idea but... by yakker · · Score: 1
    The intention of the code is to avoid this as much as possible. The only real dependency is on the SCSI controller interrupts still working.


    One other point to note: there are a number of features still to add to reduce the size of the memory image and to speed up the dumping process. We're also looking to make NMIs work properly under Linux. Stay tuned.

    --Matt

  117. Could you pull that off with Vmware? by Greyfox · · Score: 2

    I was just thinking the other day that running Linux in a VM in Linux would be handy for, among other things, somewhat more secure network services (Mail, web services, etc.) Run your server in a virtual machine and who cares if it gets cracked? Just wipe it clean and reinstall (Once you get it the way you like it, you could write the image to CD and just restore from CD every time you reboot.)

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  118. Not new, but still very useful. by Yelskwah · · Score: 2

    The core files you're seeing save the segment of memory in which the program was running. They can be used in conjunction with a debugger and image with debugging information to recreate the state of the application when it crashed, enabling the programmer to glean information about which instruction caused the crash.

    Dumping the kernel on a crash is not new but it is useful, in much the same way.

    Under HP-UX, as far as I remember, when the kernel crashes it is dumped into the swap device starting backwards from the end of swap. One of the first actions of the boot sequence (and boy can that take a long time) is to check whether there is a kernel image written in swap. If so, it's copied out and can be sent back to the kernel team for investigation.

    Of course, if your boot sequence doesn't copy out the kernel, you've got a finite time to get it out yourself before it's overwritten by the ever-advancing swap data.

    -John

  119. Redundant kernels? by seebs · · Score: 1

    This makes very little sense. The purpose of the kernel is to be the thing which allocates resources. If it screwed up, how do you "recover"?

    Do both kernels have access to the serial ports? Does only one? If it's only one, how do you guess what state the port is in when it dies and the other takes over? If it's both, how do you keep them from conflicting?

    It turns out that the program which keeps the kernels from conflicting is, in fact, the kernel.

    This mind is not buddha.

    --
    My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
  120. Kernel Debug bug bug by the+eric+conspiracy · · Score: 2

    I think that this is excellent news. And more so for Linux than any other system.

    It is crucially important that a community project like Linux have good debugging tools, both from the perspective of quality control, and to encourage others to get involved in the community.

    Other systems that are open but don't actively encourage contributions, or worse yet are closed - well, these debuggers are usefull in the sense that it helps pin point a problem. But in many cases you don't have control of the source code, so there isn't much you can do except mail it to the developers. If they even have a place to mail it to.

  121. Not all UNIX systems have a huge /proc by Jeff+Mahoney · · Score: 1

    In many Commercial UNIX systems, the proc filesystem isn't as broad as that of linux and the low-level tuning parameters are configured using the Kernel debugger. In some cases, the kernel debugger is the general program debugger with the ability to traverse /dev/kmem - and some functions to manipulate the appropriate data structures.

    Debuggers aren't only used for debugging.

  122. This would be GREAT if the Linux kernel crashed... by Anonymous Coward · · Score: 1

    But as it is, what's the point. Has anyone ever actually SEEN a kernel panic?

  123. It's intended for kernel developers by EngrBohn · · Score: 3

    So if you play with a x.(2y+1).z kernel while rubbing your feet on the carpet and a lightning rod attached to an ISA slot, then this is for you. If you only use a x.(2y).z kernel with z>2, then this'll probably do nothing more than occupy disk space.
    Christopher A. Bohn

    --
    cb
    Oooh! What does this button do!?
  124. Uhhhh, This isn't a new thing. by bifrost · · Score: 2

    Just about every other OS I know of (except for NT) includes this. Having a Kernel Debugger, Kernel Core Dump, and a few other tools available over the past few *YEARS* has saved me a lot of hassle. If Linux hasn't had this till now, I'm sooooooooooo sorry. Thats really dissapointing.
    *BSD, Solaris, Dynix, and bazillions of other OS'es have had this ever since they were created.

    1. Re:Uhhhh, This isn't a new thing. by Anonymous Coward · · Score: 0

      NT has had it from day one.

  125. I like it... by Fish+Man · · Score: 1

    This sounds like just about the coolest software utility that I will almost never have to use!

    A Windows version would be orders of magnatude more useful!

  126. great idea but... by eries · · Score: 1

    What happens when the LKCDA crashes during a system crash? Who recovers from that??

  127. I want my BSOD! by Gottjager · · Score: 1


    Forget knowing why my system crashed, I just want my BSOD. They better have someone working on bringing that to Linux...

  128. I like it... by Fish+Man · · Score: 1

    This sounds like just about the coolest software utility that I will almost never have to use! [Grin]

    A Windows version would be orders of magnatude more useful! [Bigger grin!]

  129. Core and SCO by Anonymous Coward · · Score: 0

    This is really cool, and could save people lots of time and engery ... At one Place I worked for that was using SCO we we're having problems with something causing massive core dumps, so we sent the dump off to SCO and a few days latter they sent us back a message describing excatly what was wrong with the system. It was something like, On line 10697 you'll see four dollar signs, that's a problem with the Network card... It's good to see something like that for linux now.

  130. Do we really need this? by Greyfox · · Score: 2
    I don't think I've EVER had my kernel crash on me...

    OK, I could see as how it might help the developers... ;-)

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  131. Device driver writers rejoice! by Stiletto · · Score: 1

    This will be tremendously usefull for us device driver writers, and all other breeds of kernel-hackers. True, Linux rarely crashes under normal use, but when your code is running with the kernel and you make a mistake... OOPS!

  132. Re:This implementation is much less than what BSD by Eric+Smith · · Score: 1
    Even simple microcontroller-based devices that I've personally hacked together very quickly usually end up taking more than a day. Only the most trivial ones were finished in a period of hours.

    For some examples, see my PIC page.

  133. Re: Just new to Linux by doubleyou · · Score: 1

    There's a kernel debugger, right? Hit a special key sequence and the whole system stops, so you can look around at data structures? Is there? Someone please say yes...

    I'm sorry, but this made me laugh out-loud.

    I too am curious if there is some sort of kdb under linux. This sort of thing was avaliable for SCO when I was doing driver maintenance for it at my last job (but I barely knew how to use a kernel debugger at that time). I've been sysadmining IRIX at my current job, and I've been looking for a kernel debugger and the key sequence to get into it, but haven't found it yet. (Admittedly, I haven't been looking too hard, but I'm curious nonetheless).

    Catcha' later,
    Paul.

  134. Closed Source and Crash Dumps by seebs · · Score: 1

    Well, as a data point, BSD/OS (www.bsdi.com) is arguably "closed source", and gives out crash dumps.

    I'm in the support group, and we find crash dumps *very* valuable. It is not necessary for the customer to necessarily have all the source, just a kernel with known characteristics...

    --
    My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
  135. Yep, a 2.0.35 kernel crashed on me by quade]CnM[ · · Score: 1

    Then again, I pulled the IDE cable out of the drive when it was running. I bet that dident help the situation. Other then that, a 2.1.132 kernel mysteriously locked up on me, but that was a develoment kernel, and the machine had a 70+ day uptime. Also, I tried one of the latest 2.3.x kernels, and it crashed on me detecting the USB controler, but then I expected that :)

  136. Check your memory and fans by Anonymous Coward · · Score: 0

    I had repeating sig 11 crashes when ever I compilied something or did something cpu intensive like a find | filename. I took the ram speed down in my bios and after that I took my case off and the problem vanished.

  137. Re:FIRST POST!!! by Anonymous Coward · · Score: 0

    I'd support a block on your IP address if I thought it'd make you quit whining.

  138. Re:Uhhhh, This isn't a new thing.. 30 years ago... by Anonymous Coward · · Score: 0

    Thats right IBM/Speery/Burroughs/GE/ICL had it. To catch real bugs, and hardware errors. Of course written to the raw to disk.. But now wittten to memory - after checking for defined events (known psw values), and taking defined actions (ignore, dump, kill x, or user code), plus IPCS to read the dump afterwards. The dump program formats plenty, and between stacks and pointer chains, you have a fair idea of whodidit. Of course you have SADUMP, so you can read the dump without the OS, or boot a mini system . All good stuff - except MVS is now 99.999% reliable. I dont think MS has a parmlib, with a list of paramaters about what user definable actions to take. IBM also has slip traps - a kinda softice (but hardware ice). You set a trap or event, and get a dump when it occurs. For this reason, no bug survives on MVS.

  139. Re:Just like NT 4.0 by Anonymous Coward · · Score: 0

    The serial hookup trick sounds like debugging an Amiga circa 1990.

  140. Re:This would be GREAT if the Linux kernel crashed by TummyX · · Score: 1

    Uh, yes at least 4 times in the past 6 months.

  141. I can even imagine it... by Kinthelt · · Score: 1
    "Hello, this is Microsoft tech support, how may I help you?"

    "Oh, you BSODed? Did you install Service Pack 5?"

    "You did. Well, what applications were you running?"

    "Just a Q3Test server? Okay, send me the core dump and I'll check it out." (yeah right!)

    Few minutes go by

    "Okay, we've thoroughly analysed the data. It seems that NT crashed."

    "Oh, you know it crashed? Yeah, right! You told me it crashed. Forgot about that. Hmmmmm... Have you tried rebooting? That usually works."

    "It works now? Great. That'll be $3500 payable to Microsoft Inc."

    "You want to know why NT crashed? I'm sorry, we cannot do that. That is proprietary information. We seem to have solved your problem, however. Your computer seems to be working fine. We expect payment within two weeks. Have a nice day."

    --

    "Evil will always triumph over good, because good is dumb." - Dark Helmet (Spaceballs)

  142. sun suing by mattdm · · Score: 1
    IRIX machines have been doing this for quite a while....

    --

  143. 'bad programs' by mattdm · · Score: 2
    It's not very likely to be a problem with a userland program, but rather something with the kernel itself -- maybe a third-party kernel module, or something you're hacking.

    --

  144. Doesn't BSD have this already? by seebs · · Score: 2

    This sounds *EXACTLY* like the way BSD kernels have, since the dawn of time, handled panics. If you have enough swap space, the kernel dumps a complete core image (in a special format) to the swap device. Then, on boot, it extracts it before enabling swap, and copies a kernel over. (Goes in /var/crash, if such a place exists.)

    I've used this to debug (or have someone else debug) kernel panics on BSD/OS and NetBSD systems. It's a *very* nice feature, because, in the real world, you often have a crash that can't be encouraged to happen right when the engineer is handy.

    Common feature, been available for years. I just *assumed* Linux had it.

    --
    My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
  145. Re:FIRST POST!!! by gregstoll · · Score: 1

    Wow - Karma = -30?? Not exactly adding intelligent insight to stories... On a related note, would Rob or whoever be willing to block his IP address from commenting or something? I mean, with 22 comments and nothing but "First Post"...would people be willing to support this? Just curious...

  146. Just like NT 4.0 by Lt_Kernal · · Score: 1

    Windows NT 4.0 does a very similar thing when the appropriate options are checked in the Startup and Recovery options under the System object in Control Panel. Problem is...it's the ENTIRE memory space. If you have 128 MB of RAM, you better have 128 MB of swapfile space on the system partition. Not a smart thing to do when the boot partition (where the \WINNT directory is) resides in the same partition...the hard disk has to constantly shuffle btw the swap file and the \WINNT directory. If you place the swap file on a different partition (i.e. optimize for I/O speed), the crash dump file (memory.dmp) is not created when NT bluescreens. This particular thing that SGI's doing is a MUCH smarter way of going about it. Though one of the coolest things about Win2K is the fact that you can choose btw a full mem dump, a kernel mem dump, and a 64K minidump. That's a Good Thing for those of us who like to optimize our swap file and move it to a different partition or split it up a bit.

    However...sifting through that crap with the dumpchk.exe and dumpexam.exe utilities is akin to getting your teeth pulled...:)

    Another nifty thing NT has is the ability to t-shoot a box by hooking up another NT box to it thru the serial port (or remotely, with a modem) and, by using the symbol files, find out EXACTLY where in the OS code a particular process is failing, because when NT bluescreens, it's not really crashed...the kernel is still spinning happily away churning out that dump file. That ain't too bad, but it's a bitch to set up.

    I prefer just to decipher the bluescreen and find out which piece of shit hardware (or driver) is causing the failure...:)

    -Kevin Bunn, MCSE/MCT - MCP ID # 1198191

    PS: Yes, you heard it correctly. The way NT does it, the BOOT partition is where the system files (i.e. \WINNT) are and the SYSTEM partition is where the boot files (i.e. boot.ini, ntldr, and ntdetect.com) are. Another weird MS-ism...:P

    --
    My posts don't reflect the opinion of my employer, and my employer's opinion doesn't influence the content of my posts.