If you experience hangs or panics during Solaris boot, whether it's during installation or after you've already installed, using the kernel debugger can be a big help in collecting the first set of "what happened" information.
The kernel debugger is named "kmdb" in Solaris 10 and later, and is invoked by supplying the '-k' switch in the kernel boot arguments. So a common request from a kernel engineer starting to examine a problem is often "try booting with kmdb".
Sometimes it's useful to either set a breakpoint to pause the kernel startup and examine something, or to just set a kernel variable to enable or disable a feature, or enable debugging output. If you use -k to invoke kmdb, but also supply the '-d' switch, the debugger will be entered before the kernel really starts to do anything of consequence, so that you can set kernel variables or breakpoints.
So "booting with the -kd flags" is the key to "booting under the kernel debugger". Now, how do we do that?
Kernel debugger in Solaris 10
[This might not work for OpenSolaris, but I still keep it here for the Solaris 10 case]
To enter the debugger with Solaris 10, enter "b -kd" to the appropriate prompt; this is slightly different whether you're installing or booting an already-installed system:
Install:
Select the type of installation you want to perform:
1 Solaris Interactive
2 Custom JumpStart
3 Solaris Interactive Text (Desktop session)
4 Solaris Interactive Text (Console session)
Enter the number of your choice followed by the <ENTER> key.
Alternatively, enter custom boot arguments directly.
If you wait for 30 seconds without typing anything,
an interactive installation will be started.
Select type of installation:
Installed system:
Type b [file-name] [boot-flags] <ENTER> to boot with options
or i <ENTER> to enter boot interpreter
or <ENTER> to boot with defaults'
<<< timeout in 5 seconds >>>"
Select (b)oot or (i)nterpreter:
Either way, you'll drop into the kernel debugger in short order, which will announce itself with this prompt:
[0]>
(The number in square brackets is the CPU that is running the kernel debugger; that number might change for later entries into the debugger.)
Kernel debugging with GRUB-boot systems
If instead, you're doing this with Software Express build later than 05/05, where GRUB is used to boot Solaris, you add the -kd to the "kernel" line in the GRUB menu entry (you can edit GRUB menu entries for this boot by using the GRUB menu interface, and the 'e' (for edit) key).
Now we're in the kernel debugger
There are two good reasons to run under the kernel debugger:- If we panic, the panic can be examined before reboot; you can get stack backtraces and get some idea of which section of code might be at fault.
- Now we can set kernel variables, set breakpoints, etc. to affect the kernel run.
- For investigating hangs: try turning on module debugging output. You can set the value of a kernel variable by using the '/W' command ("write a 32-bit value"). Here's how you set moddebug to 0x80000000, and then continue execution of the kernel:
[0]> moddebug/W 80000000
That will give you debug output for each kernel module that loads. (see /usr/include/sys/modctl.h, near the bottom, for moddebug flag information. I find 0x80000000 is the only one I really ever use.)
[0]> :c - To collect information about panics: when the kernel panics, it will drop into the debugger, and print some interesting information; however, usually the most interesting thing, first, is the stack backtrace; this shows, in reverse order, all the functions that were active at the time of panic. To generate a stack backtrace, use
[0]> $c
A few other very useful information commands during a panic are
::msgbuf
which will show you the last things the kernel printed onscreen, and::status
which shows a summary of the state of the machine in panic. - If you're running the kernel while the kernel debugger is active, and you experience a hang, you may be able to break into the debugger to examine the system state; you can do this by pressing the <F1> and <A> keys at the same time (a sort of "F1-shifted-A" keypress). (On SPARC systems, this key sequence is <Stop>-<A>.) This should give you the same debugger prompt as above, although on a multi-CPU system you may see the CPU number in the prompt is something other than 0. Once in the kernel debugger, you can get a stack backtrace as above; you can also use ::switch to change the CPU and get stack backtraces on the different CPU, which might shed more light on the hang. For instance, if you break into the debugger on CPU 1, you could switch to CPU 0 with
[1]> 0::switch
To disassemble the code where panic
When panic happens and kmdb is entered. You always get some message like:panic[cpu0]/thread=
rp=fffffffffbca6090 addr=d occurred in module "unix" due to a NULL
pointer dereference
#pf Page fault
Bad kernel fault at addr=0xd
pid=0, pc=0xfffffffffb846d29, sp=0xfffffffffbca6180, eflags=0x10246
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 2620<vmxe,xmme,fxsr,pae>
cr2: d
rdi: 286 rsi: 0 rdx: fffffffe
rcx: 1 r8: 0 r9: 40000
rax: d rbx: 0 rbp: fffffffffbca61c0
r10: fffffffffbc74ab0 r11: ffffff012fe59000 r12: 0
r13: fffffffffbcb6dc0 r14: 1 r15: ffffff0135a8d580
fsb: 200000000 gsb: fffffffffbc74ab0 ds: 0
es: 0 fs: 0 gs: 0
trp: e err: 2 rip: fffffffffb846d29
cs: e030 rfl: 10246 rsp: fffffffffbca6180
ss: e02b
cpu address timestamp type vc handler pc
0 fffffffffbc20f18 6afbf91146 trap e #pf ec_bind_virq_to_irq+99
0 fffffffffbc20d90 6afb5cd249 intr 104 cbe_fire inflate_fast+112
0 fffffffffbc20c08 6afb155e6b intr 13 uhci_intr HYPERVISOR_sched_op+29
0 fffffffffbc20a80 6afb02521b intr 13 uhci_intr HYPERVISOR_sched_op+29
0 fffffffffbc208f8 6afaeef1e9 intr 13 uhci_intr HYPERVISOR_sched_op+29
0 fffffffffbc20770 6afad1750e intr 13 uhci_intr HYPERVISOR_sched_op+29
0 fffffffffbc205e8 6afab239db intr 13 uhci_intr HYPERVISOR_sched_op+29
0 fffffffffbc20460 6afa8da042 intr 13 uhci_intr HYPERVISOR_sched_op+29
0 fffffffffbc202d8 6afa725ad3 intr 13 uhci_intr copy_pattern+1f
0 fffffffffbc20150 6afa618a15 intr 13 uhci_intr HYPERVISOR_sched_op+29
fffffffffbca5f50 unix:die+d2 ()
fffffffffbca6080 unix:trap+162f ()
fffffffffbca6090 unix:cmntrap+24d ()
fffffffffbca61c0 unix:ec_bind_virq_to_irq+99 ()
fffffffffbca61f0 xpv_psm:xen_psm_cpu_start+4b ()
fffffffffbca6210 unix:mach_cpu_start+4a ()
fffffffffbca6270 unix:start_cpu+5e ()
fffffffffbca62b0 unix:start_other_cpus+db ()
fffffffffbca62f0 genunix:main+2bf ()
fffffffffbca6300 unix:_locore_start+80 ()
panic: entering debugger (no dump device, continue to reboot)
Loaded modules: [ scsi_vhci neti xpv_psm zfs uhci hook ip usba specfs sctp arp
xpv_uppc ]
kmdb: target stopped at:
kmdb_enter+0xb: movq %rax,%rdi
[0]>
The stack backtrace displaying here is giving you the messages where your panic happens. You might know exactly where the panic point is. A fragement of assemble may give you more information:
[0]> ec_bind_virq_to_irq+0x90::dis
ec_bind_virq_to_irq+0x90: sti
ec_bind_virq_to_irq+0x91: movq %r13,%rdi
ec_bind_virq_to_irq+0x94: call +0x16d37 <mutex_exit>
ec_bind_virq_to_irq+0x99: addb %al,(%rax)
<< assemble after corrupt
ec_bind_virq_to_irq+0x9b: addb %al,(%rax)
ec_bind_virq_to_irq+0x9d: addb %al,(%rax)
ec_bind_virq_to_irq+0x9f: addb %al,(%rax)
ec_bind_virq_to_irq+0xa1: sti
ec_bind_virq_to_irq+0xa2: popq %r14
ec_bind_virq_to_irq+0xa4: popq %r13
ec_bind_virq_to_irq+0xa6: popq %r12
[0]> ::dis
ec_bind_virq_to_irq+0xa8: popq %rbx
ec_bind_virq_to_irq+0xa9: leave
ec_bind_virq_to_irq+0xaa: ret
ec_bind_virq_to_irq+0xab: leaq +0x110296(%rip),%rdi
<0xfffffffffb956fd8>
You can get the exact assemble at "ec_bind_virq_to_irq+99". And you should where this is related to the C code if you have some assemble background.
To set a write/read break point
If you'd like the kernel stop when some memory space is write/read, the read/write break point supported in kmdb might be very helpful to you.
[0]> ec_bind_virq_to_irq+0xab::wp -w -L 8
Above command set a write point when memory space start from "ec_bind_virq_to_irq+0xab" with length 8.
To get the module information
Some modules depend on other modules to be loaded. So watch the modules kernel has loaded during boot is neccessary in some cases.
[todo]
Get dev_info related information
Dev_info is very important when I am debugging a device driver alike module. Some dev_info related cmd/walk is helpful sometimes.
[todo]
Get memory/interrupt information
[allen@tecra:~]echo "::kmastat ! grep Total"| sudo mdb -k
Total [hat_memload] 5541888B 4157558 0
Total [kmem_msb] 4657152B 127397 0
Total [kmem_va] 253624320B 46322 0
Total [kmem_default] 251473920B 1316330951 0
Total [kmem_io_2G] 34635776B 8480 0
Total [bp_map] 131072B 403 0
Total [umem_np] 786432B 724 0
Total [segkp] 458752B 8756 0
Total [ip_minor_arena_sa] 64B 1484 0
Total [ip_minor_arena_la] 128B 917367 0
Total [dld_ctl] 64B 31 0
Total [spdsock] 64B 1 0
Total [namefs_inodes] 64B 26 0
[allen@tecra:~]echo "::interrupts" | sudo mdb -k
----skip the output----
[allen@tecra:~]echo "::memstat"| sudo mdb -k
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 103031 402 20%
Anon 110261 430 21%
Exec and libs 17929 70 3%
Page cache 51870 202 10%
Free (cachelist) 217239 848 42%
Free (freelist) 19556 76 4%
Total 519886 2030
Physical 519885 2030
Debugging the xVM hypervisor
If you want to see hypervisor output over a serial line, edit the kernel$ line:title Solaris xVM
kernel$ /boot/$ISADIR/xen.gz console=com1 com1=9600,8n1
module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix -B console=hypervisor
module$ /platform/i86pc/$ISADIR/boot_archive
Disable a damaged driver in grub kernel line
If you discover the name of the driver you think is doing the damage, add
-B disable-<drivername>=true
to your Grub kernel line; that will prevent the driver from loading.
There's obviously a lot more you can do with the kernel debugger, but these small tips will sometimes help get from a "I have no idea what to do" to "I have a few ideas to try that might let me continue to boot or install", which can make all the difference.