OpenSolaris kernel diagosing and debugging (持续更新中)

Diagnosing kernel hangs/panics with kmdb and moddebug

If you experience hangs or panics during Solaris boot, whether it's during installation or after you've already installed, using the kernel debugger can be a big help in collecting the first set of "what happened" information.

The kernel debugger is named "kmdb" in Solaris 10 and later, and is invoked by supplying the '-k' switch in the kernel boot arguments. So a common request from a kernel engineer starting to examine a problem is often "try booting with kmdb".

Sometimes it's useful to either set a breakpoint to pause the kernel startup and examine something, or to just set a kernel variable to enable or disable a feature, or enable debugging output. If you use -k to invoke kmdb, but also supply the '-d' switch, the debugger will be entered before the kernel really starts to do anything of consequence, so that you can set kernel variables or breakpoints.

So "booting with the -kd flags" is the key to "booting under the kernel debugger". Now, how do we do that?

Kernel debugger in Solaris 10

[This might not work for OpenSolaris, but I still keep it here for the Solaris 10 case]


To enter the debugger with Solaris 10, enter "b -kd" to the appropriate prompt; this is slightly different whether you're installing or booting an already-installed system:

Install:

Select the type of installation you want to perform:

1 Solaris Interactive
2 Custom JumpStart
3 Solaris Interactive Text (Desktop session)
4 Solaris Interactive Text (Console session)

Enter the number of your choice followed by the <ENTER> key.
Alternatively, enter custom boot arguments directly.

If you wait for 30 seconds without typing anything,
an interactive installation will be started.

Select type of installation:

Installed system:

Type    b [file-name] [boot-flags] <ENTER>      to boot with options
or i <ENTER> to enter boot interpreter
or <ENTER> to boot with defaults'

<<< timeout in 5 seconds >>>"

Select (b)oot or (i)nterpreter:

Either way, you'll drop into the kernel debugger in short order, which will announce itself with this prompt:

[0]>

(The number in square brackets is the CPU that is running the kernel debugger; that number might change for later entries into the debugger.)

Kernel debugging with GRUB-boot systems

If instead, you're doing this with Software Express build later than 05/05, where GRUB is used to boot Solaris, you add the -kd to the "kernel" line in the GRUB menu entry (you can edit GRUB menu entries for this boot by using the GRUB menu interface, and the 'e' (for edit) key).

Now we're in the kernel debugger
There are two good reasons to run under the kernel debugger:
  1. If we panic, the panic can be examined before reboot; you can get stack backtraces and get some idea of which section of code might be at fault.
  2. Now we can set kernel variables, set breakpoints, etc. to affect the kernel run.
Obviously, there's a lot you can do in a kernel debugger, and I'm only touching on it here, but here are two good ones:
  1. For investigating hangs: try turning on module debugging output. You can set the value of a kernel variable by using the '/W' command ("write a 32-bit value"). Here's how you set moddebug to 0x80000000, and then continue execution of the kernel:
    [0]> moddebug/W 80000000
    [0]> :c
    That will give you debug output for each kernel module that loads. (see /usr/include/sys/modctl.h, near the bottom, for moddebug flag information. I find 0x80000000 is the only one I really ever use.)
  2. To collect information about panics: when the kernel panics, it will drop into the debugger, and print some interesting information; however, usually the most interesting thing, first, is the stack backtrace; this shows, in reverse order, all the functions that were active at the time of panic. To generate a stack backtrace, use
    [0]> $c

    A few other very useful information commands during a panic are

    ::msgbuf
    which will show you the last things the kernel printed onscreen, and
    ::status
    which shows a summary of the state of the machine in panic.
  3. If you're running the kernel while the kernel debugger is active, and you experience a hang, you may be able to break into the debugger to examine the system state; you can do this by pressing the <F1> and <A> keys at the same time (a sort of "F1-shifted-A" keypress). (On SPARC systems, this key sequence is <Stop>-<A>.) This should give you the same debugger prompt as above, although on a multi-CPU system you may see the CPU number in the prompt is something other than 0. Once in the kernel debugger, you can get a stack backtrace as above; you can also use ::switch to change the CPU and get stack backtraces on the different CPU, which might shed more light on the hang. For instance, if you break into the debugger on CPU 1, you could switch to CPU 0 with
    [1]> 0::switch

To disassemble the code where panic
When panic happens and kmdb is entered. You always get some message like:

panic[cpu0]/thread=
fffffffffbc736e0: BAD TRAP: type=e (#pf Page fault)
rp=fffffffffbca6090 addr=d occurred in module "unix" due to a NULL
pointer dereference

#pf Page fault
Bad kernel fault at addr=0xd
pid=0, pc=0xfffffffffb846d29, sp=0xfffffffffbca6180, eflags=0x10246
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 2620<vmxe,xmme,fxsr,pae>
cr2: d
       rdi:              286 rsi:                0 rdx:         fffffffe
       rcx:                1  r8:                0  r9:            40000
       rax:                d rbx:                0 rbp: fffffffffbca61c0
       r10: fffffffffbc74ab0 r11: ffffff012fe59000 r12:                0
       r13: fffffffffbcb6dc0 r14:                1 r15: ffffff0135a8d580
       fsb:        200000000 gsb: fffffffffbc74ab0  ds:                0
        es:                0  fs:                0  gs:                0
       trp:                e err:                2 rip: fffffffffb846d29
        cs:             e030 rfl:            10246 rsp: fffffffffbca6180
        ss:             e02b

cpu          address    timestamp type  vc  handler   pc
 0 fffffffffbc20f18   6afbf91146 trap   e      #pf ec_bind_virq_to_irq+99
 0 fffffffffbc20d90   6afb5cd249 intr 104 cbe_fire inflate_fast+112
 0 fffffffffbc20c08   6afb155e6b intr  13 uhci_intr HYPERVISOR_sched_op+29
 0 fffffffffbc20a80   6afb02521b intr  13 uhci_intr HYPERVISOR_sched_op+29
 0 fffffffffbc208f8   6afaeef1e9 intr  13 uhci_intr HYPERVISOR_sched_op+29
 0 fffffffffbc20770   6afad1750e intr  13 uhci_intr HYPERVISOR_sched_op+29
 0 fffffffffbc205e8   6afab239db intr  13 uhci_intr HYPERVISOR_sched_op+29
 0 fffffffffbc20460   6afa8da042 intr  13 uhci_intr HYPERVISOR_sched_op+29
 0 fffffffffbc202d8   6afa725ad3 intr  13 uhci_intr copy_pattern+1f
 0 fffffffffbc20150   6afa618a15 intr  13 uhci_intr HYPERVISOR_sched_op+29

fffffffffbca5f50 unix:die+d2 ()
fffffffffbca6080 unix:trap+162f ()
fffffffffbca6090 unix:cmntrap+24d ()
fffffffffbca61c0 unix:ec_bind_virq_to_irq+99 ()
fffffffffbca61f0 xpv_psm:xen_psm_cpu_start+4b ()
fffffffffbca6210 unix:mach_cpu_start+4a ()
fffffffffbca6270 unix:start_cpu+5e ()
fffffffffbca62b0 unix:start_other_cpus+db ()
fffffffffbca62f0 genunix:main+2bf ()
fffffffffbca6300 unix:_locore_start+80 ()

panic: entering debugger (no dump device, continue to reboot)

Loaded modules: [ scsi_vhci neti xpv_psm zfs uhci hook ip usba specfs sctp arp
xpv_uppc ]
kmdb: target stopped at:
kmdb_enter+0xb: movq   %rax,%rdi
[0]>

The stack backtrace displaying here is giving you the messages where your panic happens. You might know exactly where the panic point is. A fragement of assemble may give you more information:

[0]> ec_bind_virq_to_irq+0x90::dis
ec_bind_virq_to_irq+0x90:       sti
ec_bind_virq_to_irq+0x91:       movq   %r13,%rdi
ec_bind_virq_to_irq+0x94:       call   +0x16d37 <mutex_exit>
ec_bind_virq_to_irq+0x99:       addb   %al,(%rax)
<< assemble after corrupt
ec_bind_virq_to_irq+0x9b:       addb   %al,(%rax)
ec_bind_virq_to_irq+0x9d:       addb   %al,(%rax)
ec_bind_virq_to_irq+0x9f:       addb   %al,(%rax)
ec_bind_virq_to_irq+0xa1:       sti
ec_bind_virq_to_irq+0xa2:       popq   %r14
ec_bind_virq_to_irq+0xa4:       popq   %r13
ec_bind_virq_to_irq+0xa6:       popq   %r12
[0]> ::dis
ec_bind_virq_to_irq+0xa8:       popq   %rbx
ec_bind_virq_to_irq+0xa9:       leave
ec_bind_virq_to_irq+0xaa:       ret
ec_bind_virq_to_irq+0xab:       leaq   +0x110296(%rip),%rdi
<0xfffffffffb956fd8>

You can get the exact assemble at "ec_bind_virq_to_irq+99". And you should where this is related to the C code if you have some assemble background.

To set a write/read break point
If you'd like the kernel stop when some memory space is write/read, the read/write break point supported in kmdb might be very helpful to you.

[0]> ec_bind_virq_to_irq+0xab::wp -w -L 8

Above command set a write point when memory space start from "ec_bind_virq_to_irq+0xab" with length 8.

To get the module information
Some modules depend on other modules to be loaded. So watch the modules kernel has loaded during boot is neccessary in some cases.

[todo]

Get dev_info related information
Dev_info is very important when I am debugging a device driver alike module. Some dev_info related cmd/walk is helpful sometimes.

[todo]

Get memory/interrupt information
[allen@tecra:~]echo "::kmastat ! grep Total"| sudo mdb -k
Total [hat_memload]                              5541888B   4157558     0
Total [kmem_msb]                                 4657152B    127397     0
Total [kmem_va]                                253624320B     46322     0
Total [kmem_default]                           251473920B 1316330951     0
Total [kmem_io_2G]                              34635776B      8480     0
Total [bp_map]                                    131072B       403     0
Total [umem_np]                                   786432B       724     0
Total [segkp]                                     458752B      8756     0
Total [ip_minor_arena_sa]                             64B      1484     0
Total [ip_minor_arena_la]                            128B    917367     0
Total [dld_ctl]                                       64B        31     0
Total [spdsock]                                       64B         1     0
Total [namefs_inodes]                                 64B        26     0
[allen@tecra:~]echo "::interrupts" | sudo mdb -k
----skip the output----


[allen@tecra:~]echo "::memstat"| sudo mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     103031               402   20%
Anon                       110261               430   21%
Exec and libs               17929                70    3%
Page cache                  51870               202   10%
Free (cachelist)           217239               848   42%
Free (freelist)             19556                76    4%

Total                      519886              2030
Physical                   519885              2030

Debugging the xVM hypervisor
If you want to see hypervisor output over a  serial line, edit the kernel$ line:
   title Solaris xVM
   kernel$ /boot/$ISADIR/xen.gz console=com1 com1=9600,8n1
   module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix -B console=hypervisor
   module$ /platform/i86pc/$ISADIR/boot_archive

Disable a damaged driver in grub kernel line
If you discover the name of the driver you think is doing the damage, add
     -B disable-<drivername>=true
to your Grub kernel line; that will prevent the driver from loading.

There's obviously a lot more you can do with the kernel debugger, but these small tips will sometimes help get from a "I have no idea what to do" to "I have a few ideas to try that might let me continue to boot or install", which can make all the difference.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值