0.linux内核异常常用分析方法
异常地址是否在0附近,确认是否是空指针解引用问题
异常地址是否在iomem映射区,确认是否是设备访问总线异常问题,如PCI异常导致的地址访问异常
异常地址是否在stack附近,如果相邻,要考虑是否被踩
比较delay reset/nmi watchdog等多种机制打印的栈信息,看看pc是否在动,确定是否是死锁
用SysRq判断是真死还是假死
通过反汇编获得发生异常的C代码段和函数,查找开源社区是否已有补丁修复
下面分别通过PowerPC和Mips64的2个异常例子详细讲解分析过程。
1.PowerPC小系统内核异常分析
1.1 异常打印
Unable to handle kernel paging request for data at address 0x36fef31e
Faulting instruction address: 0xc0088b8c
Oops: Kernel access of bad area, sig: 11 [#1]
PREEMPT SMP NR_CPUS=2
Modules linked in: ossmod tipc ohci_hcd ehci_hcd cmm uart1655x bcm334 bootflash mtdchar bsp_flash_init boardctrl 85xx_debug util
NIP: C0088B8C LR: C0088CF8 CTR: 00000000
REGS: ce283e20 TRAP: 0300 Not tainted (2.6.21.7-EMBSYS-CGEL-3.04.10.P6.F5)
MSR: 00021000 CR: 22004222 XER: 00000000
DAR: 36FEF31E, DSISR: 00800000
TASK = cffdf180[26] 'events/1' THREAD: ce282000 CPU: 1
GPR00: 00100100 CE283ED0 CFFDF180 CF528000 C09EA500 EFFEAD20 CF5188A0 00000000
GPR08: CF5188BC 00200200 36FEF31E D1FD7F9E 22004222 1010DA44 00000290 00000000
GPR16: 1011C858 100147F4 BF9BC9C4 10100000 00000001 C0460000 C06454CC 00000000
GPR24: C0640000 CE282000 C0640000 00000005 00000000 00000000 EFFE8EC0 CFFED958
NIP [C0088B8C] free_block+0xc4/0x16c
LR [C0088CF8] drain_array+0xc4/0x100
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Call Trace:
[CE283ED0] [C06ABEC0] 0xc06abec0(unreliable)
[CE283EF0] [C0088CF8] drain_array+0xc4/0x100
[CE283F10] [C008A70C] cache_reap+0x94/0x13c
[CE283F30] [C003DA2C] run_workqueue+0xc4/0x198
[CE283F60] [C003E6D4] worker_thread+0x130/0x154
[CE283FB0] [C0042E80] kthread+0xd4/0x110
[CE283FF0] [C0011A70] original_kernel_thread+0x44/0x60
Instruction dump:
5400cffe 0f000000 80c4001c 7d1cf214 3c000010 3d200020 80a8001c 60000100
81660000 61290200 81460004 3906001c <916a0000> 914b0004 90060000 91260004
------------[ cut here ]------------
Badness at c0011e4c [verbose debug info unavailable]
Call Trace:
[CE283C50] [C00080BC] show_stack+0x3c/0x1a0 (unreliable)
[CE283C80] [C018EA28] report_bug+0xb0/0xb8
[CE283C90] [C000EC94] program_check_exception+0xcc/0x4f8
[CE283CD0] [C0010BE4] ret_from_except_full+0x0/0x4c
[CE283D90] [C0640000] 0xc0640000
[CE283DD0] [C000E61C] die+0x1f0/0x27c
[CE283E00] [C0014B18] bad_page_fault+0x98/0xe8
[CE283E10] [C0010A88] handle_page_fault+0x7c/0x80
[CE283ED0] [C06ABEC0] 0xc06abec0
[CE283EF0] [C0088CF8] drain_array+0xc4/0x100
[CE283F10] [C008A70C] cache_reap+0x94/0x13c
[CE283F30] [C003DA2C] run_workqueue+0xc4/0x198
[CE283F60] [C003E6D4] worker_thread+0x130/0x154
[CE283FB0] [C0042E80] kthread+0xd4/0x110
[CE283FF0] [C0011A70] original_kernel_thread+0x44/0x60
1.2 Oops分析
Oops: Kernel access of bad area, sig: 11 [#1]
异常分类
Oops:内核态指令异常;
BUG:内核检测到逻辑异常(类似于assert),会影响内核的后续运行;
WARNING:类似于BUG,但是不会影响内核的后续运行;
PANIC:类似于BUG,系统不能继续运行,直接挂起或重启;
SOFTLOCK:长时间任务得不到调度;
异常信号
Signal
Code
Default Action
Description
SIGABRT
6
A
Process abort signal
SIGALRM
14
T
Alarm clock
SIGBUS
10
A
Access to an undefined portion of a memory object
SIGCHLD
18
I - Ignore the Signal
Child process terminated, stopped,
SIGCONT
25
C - Continue the process
Continue executing, if stopped.
SIGFPE
8
A
Erroneous arithmetic operation.
SIGHUP
1
T
Hangup.
SIGILL
4
A
Illegal instruction.
SIGINT
2
T
Terminal interrupt signal.
SIGKILL
9
T
Kill (cannot be caught or ignored).
SIGPIPE
13
T - Abnormal termination of the process
Write on a pipe with no one to read it.
SIGQUIT
3
A - Abnormal termination of the process
Termina