一、堆、栈
在分析段错误之前,先了解一下什么是堆?什么是栈?
堆:一般由开发者分配释放,如果没有释放,程序结束时,在有的OS中可能会被自动释放,分配方式类似于链表。堆的操作方式为,队列优先,先进先出的原则。
栈:由操作系统自动分配,存放函数的参数值,局部变量。栈的操作方式为,先进后出的原则。
堆栈中定义了一些操作。 两个最重要的是PUSH和POP。
PUSH:操作在堆栈的顶部加入一个元素。
POP:操作相反,在堆栈顶部移去一个元素,并将堆栈的大小减一
二、分析内核驱动段错误
段错误通常都是因为指针地址出错导致的,这在C语言的代码中,非常普遍,也是非常致命的BUG。有时候处理起来非常困难。这里简单分析一下驱动中段错误的解决过程,其他复杂的段错误,都跟这个处理过程差不多,遇见复杂的段错误,需要配合打印,GDB等多种方法配合。反正DBUG是一个繁琐而又痛苦的过程,但是多数程序猿们又相爱相杀,乐此不疲,真是真爱啊!
1、这是一段段错误信息,先来阅读这个信息内容。
Unable to handle kernel NULL pointer dereference at virtual address 00000000
//内核出现了一个空指针
pgd = 7fc41725
[00000000] *pgd=337a9831, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1] ARM
Modules linked in: buttons(O)
CPU: 0 PID: 856 Comm: buttons_test Tainted: G O 4.19.8 #9
Hardware name: SMDK2440
PC is at sixth_drv_open+0x58/0x17c [buttons]
//PC就是发生错误的指令的地址
//大多时候,PC值只会给出一个地址,不到指示说是在哪个函数里
LR is at sixth_drv_open+0x4c/0x17c [buttons]
pc : [<bf0002dc>] lr : [<bf0002d0>] psr: 60000013
sp : c35abdf0 ip : 00000000 fp : 00000000
r10: c35abf70 r9 : c353f8c0 r8 : c3724a70
r7 : c0675008 r6 : 00000000 r5 : c36586c0 r4 : bf000ae8
r3 : 00000000 r2 : 347b7ccd r1 : 00000000 r0 : 00000000
//执行这条导致错误的指令时各个寄存器的值
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: c000717f Table: 33598000 DAC: 00000051
Process buttons_test (pid: 856, stack limit = 0xaced0220)
//发生错误时当前进程的名称是buttons_test
Stack: (0xc35abdf0 to 0xc35ac000)//栈信息
bde0: bf0009cc bf000af0 bf000b20 c00c8a60
be00: 00000000 347b7ccd 00000002 c353f8c0 c3724a70 c353f8c8 c00c89b0 c304e770
be20: c353f8c0 c00c09f0 c35abec0 00000000 00000002 00000000 c304e770 c00d2214
be40: 00000000 33ff418f 00000000 00000000 00000006 00000041 c36e8858 00000054
be60: 00000000 c00a1408 c35ad540 c35aa000 c35ad490 c35ad600 00000054 00000002
be80: c3724a70 c340fd30 c31df220 c359adb8 00000000 347b7ccd 00000000 00000003
bea0: c35abf70 c0675008 00000001 fffff000 c35aa000 00000000 bec32d54 c00d3efc
bec0: c340fd30 c31df220 1c9e12f6 00000007 c35e0015 0000001c 00000000 c304a198
bee0: c3724a70 00000101 00000002 00000036 00000000 00000000 00000000 c35abf00
bf00: c36e8820 b6f13000 00100877 00000000 b6f13000 00100875 00000000 c35ad7e0
bf20: b6f14000 347b7ccd 00000003 c365d300 c35e0000 00000000 00000000 00000002
bf40: ffffff9c c00e0bf0 00000002 347b7ccd ffffff9c 00000003 c0675008 ffffff9c
bf60: c35e0000 c00c1ef8 00000005 b6f13000 00000002 c00a0000 00000006 00000100
bf80: 00000001 347b7ccd 00008588 00000000 000083d4 00000005 c00091e4 c35aa000
bfa0: 00000000 c0009000 00008588 00000000 00008618 00000002 bec32eac 00000000
bfc0: 00008588 00000000 000083d4 00000005 00000000 00000000 b6f14000 bec32d54
bfe0: 00000000 bec32d38 000084fc b6e7a35c 60000010 00008618 00000000 00000000
[<bf0002dc>] (sixth_drv_open [buttons]) from [<c00c8a60>] (chrdev_open+0xb0/0x170)
[<c00c8a60>] (chrdev_open) from [<c00c09f0>] (do_dentry_open+0x1dc/0x380)
[<c00c09f0>] (do_dentry_open) from [<c00d2214>] (path_openat+0x504/0xf4c)
[<c00d2214>] (path_openat) from [<c00d3efc>] (do_filp_open+0x6c/0xe0)
[<c00d3efc>] (do_filp_open) from [<c00c1ef8>] (do_sys_open+0x128/0x1f4)
[<c00c1ef8>] (do_sys_open) from [<c0009000>] (ret_fast_syscall+0x0/0x50)
//(回溯)
2、定位PC指针
分析oops文件,得到PC=0xbf0002dc。
a、查看内核和驱动的调用地址列表
cat /proc/kallsyms >kallsyms.dis
检索PC=0xbf0002dc相近的地址,以确认出错模块。我的检索结果如下,很明显在buttons这个驱动中出错。
bf000000 t $a [buttons]
bf000000 t sixth_drv_poll [buttons]
bf000040 t $d [buttons]
bf000048 t $a [buttons]
bf000164 t $d [buttons]
bf000174 t $a [buttons]
bf0001ac t $d [buttons]
bf0001b4 t $a [buttons]
bf0001b4 t buttons_remove [buttons]
b、反汇编出错模块
如果地址处于内核,就反汇编内核,地址处于加载的某个模块 就反汇编模块。如何判断PC指针是否是属于内核本身呢,查看内核根目录的System.map ,看看是否有跟PC指针相近的地址,有就说明内核本身出错,反汇编内核,反之,反汇编模块。
这里是模块出错,反汇编模块arm-linux-objdump -D buttons.ko >buttons.dis
c、分析反汇编文件
PC = 0xbf0002dc,在函数中的偏移为0x14,buttons反汇编文件中先到对应的函数,查找0x000002dc相近的函数,找到如下函数存在:
00000284 <sixth_drv_open>:
284: e92d4010 push {r4, lr}
288: e5913020 ldr r3, [r1, #32]
28c: e24dd008 sub sp, sp, #8
290: e3130b02 tst r3, #2048 ; 0x800
294: e59f012c ldr r0, [pc, #300] ; 3c8 <sixth_drv_open+0x144>
298: 0a000034 beq 370 <sixth_drv_open+0xec>
29c: ebfffffe bl 0 <down_trylock>
2a0: e3500000 cmp r0, #0
2a4: 1a000033 bne 378 <sixth_drv_open+0xf4>
2a8: e3a03000 mov r3, #0
2ac: e59f4118 ldr r4, [pc, #280] ; 3cc <sixth_drv_open+0x148>
2b0: e59f1118 ldr r1, [pc, #280] ; 3d0 <sixth_drv_open+0x14c>
2b4: e2842008 add r2, r4, #8
2b8: e58d2004 str r2, [sp, #4]
2bc: e58d1000 str r1, [sp]
2c0: e5940010 ldr r0, [r4, #16]
2c4: e1a02003 mov r2, r3
2c8: e59f1104 ldr r1, [pc, #260] ; 3d4 <sixth_drv_open+0x150>
2cc: ebfffffe bl 0 <request_threaded_irq>
2d0: e3500000 cmp r0, #0
2d4: 1a000036 bne 3b4 <sixth_drv_open+0x130>
2d8: e3a01000 mov r1, #0
2dc: e5912000 ldr r2, [r1]
2e0: e59fe0f0 ldr lr, [pc, #240] ; 3d8 <sixth_drv_open+0x154>
2e4: e59fc0f0 ldr ip, [pc, #240] ; 3dc <sixth_drv_open+0x158>
得到0x000002dc的指令 “2dc: e5912000 ldr r2, [r1]”,意思就是在r1所在的地址取值。显然地址出错。查看buttons的sixth_drv_open函数,这里太简单了,一目了然。具体案列具体分析,阅读出错指令前后的汇编代码,大致就能推断出出错语句,结合打印。定能精确定位。
static int sixth_drv_open(struct inode *inode, struct file *file)
{
int ret;
unsigned int *val;
if (file->f_flags & O_NONBLOCK){
if (down_trylock(&button_lock))
return -EBUSY;
}else{
/* 获取信号量 */
down(&button_lock);
}
/* 配置GPF0,2为输入引脚 */
/* 配置GPG3,11为输入引脚 */
ret = request_irq(pins_desc[0].irq, buttons_irq, 0, "S2", &pins_desc[0]);
if (ret) {
printk("reqeust_irq %d for EINT0 err : %d!\n", pins_desc[0].irq, ret);
//return ret;
}
*val = *val + 0x0333;
ret = request_irq(pins_desc[1].irq, buttons_irq, 0, "S3", &pins_desc[1]);
if (ret) {
printk("reqeust_irq for EINT2 err : %d!\n", ret);
//return ret;
}
ret = request_irq(pins_desc[2].irq, buttons_irq, 0, "S4", &pins_desc[2]);
if (ret) {
printk("reqeust_irq for EINT11 err : %d!\n", ret);
//return ret;
}
ret = request_irq(pins_desc[3].irq, buttons_irq, 0, "S5", &pins_desc[3]);
if (ret) {
printk("reqeust_irq for EINT19 err : %d!\n", ret);
//return ret;
}
return 0;
}
三、分析栈回溯
截取oops的栈回溯信息
pc : [<bf0002dc>] lr : [<bf0002d0>] psr: 60000013
sp : c35abdf0 ip : 00000000 fp : 00000000
r10: c35abf70 r9 : c353f8c0 r8 : c3724a70
r7 : c0675008 r6 : 00000000 r5 : c36586c0 r4 : bf000ae8
r3 : 00000000 r2 : 347b7ccd r1 : 00000000 r0 : 00000000
Stack: (0xc35abdf0 to 0xc35ac000)//栈信息
bde0: bf0009cc bf000af0 bf000b20 c00c8a60
be00: 00000000 347b7ccd 00000002 c353f8c0 c3724a70 c353f8c8 c00c89b0 c304e770
be20: c353f8c0 c00c09f0 c35abec0 00000000 00000002 00000000 c304e770 c00d2214
be40: 00000000 33ff418f 00000000 00000000 00000006 00000041 c36e8858 00000054
be60: 00000000 c00a1408 c35ad540 c35aa000 c35ad490 c35ad600 00000054 00000002
be80: c3724a70 c340fd30 c31df220 c359adb8 00000000 347b7ccd 00000000 00000003
bea0: c35abf70 c0675008 00000001 fffff000 c35aa000 00000000 bec32d54 c00d3efc
bec0: c340fd30 c31df220 1c9e12f6 00000007 c35e0015 0000001c 00000000 c304a198
bee0: c3724a70 00000101 00000002 00000036 00000000 00000000 00000000 c35abf00
bf00: c36e8820 b6f13000 00100877 00000000 b6f13000 00100875 00000000 c35ad7e0
bf20: b6f14000 347b7ccd 00000003 c365d300 c35e0000 00000000 00000000 00000002
bf40: ffffff9c c00e0bf0 00000002 347b7ccd ffffff9c 00000003 c0675008 ffffff9c
bf60: c35e0000 c00c1ef8 00000005 b6f13000 00000002 c00a0000 00000006 00000100
bf80: 00000001 347b7ccd 00008588 00000000 000083d4 00000005 c00091e4 c35aa000
bfa0: 00000000 c0009000 00008588 00000000 00008618 00000002 bec32eac 00000000
bfc0: 00008588 00000000 000083d4 00000005 00000000 00000000 b6f14000 bec32d54
bfe0: 00000000 bec32d38 000084fc b6e7a35c 60000010 00008618 00000000 00000000
每个函数在执行时,系统都会为这个函数分配一个栈,栈中保存返回地址、局部变量等。
第二部分的分析由PC 指针可以定位到最终调用了sixth_drv_open函数,那么是如何调用到这个函数的呢?
00000284 <sixth_drv_open>:
284: e92d4010 push {r4, lr}
288: e5913020 ldr r3, [r1, #32]
28c: e24dd008 sub sp, sp, #8
290: e3130b02 tst r3, #2048 ; 0x800
294: e59f012c ldr r0, [pc, #300] ; 3c8 <sixth_drv_open+0x144>
298: 0a000034 beq 370 <sixth_drv_open+0xec>
29c: ebfffffe bl 0 <down_trylock>
反汇编buttons.dis中发现push指令,这就是压栈,保存函数调用的返回地址等。还有sub减法指令。由此可见站空间为4个32位数据。lr在ARM汇编指令中,就是r14寄存器。按照寄存器顺序压栈,如下:从oops的栈信息可以阅读出lr = c00c8a60,也就是调用sixth_drv_open的函数返回地址为c00c8a60。
知道了返回地址c00c8a60,定位函数。在kallsyms.dis中检索相近的地址。发现c00c8a60舒宏宇内核调用。反汇编内核,检索文件,发现在chrdev_open中调用了sixth_drv_open。这里得到chrdev_open->sixth_drv_open
c00c89b0 <chrdev_open>:
203948 c00c89b0: e92d43f0 push {r4, r5, r6, r7, r8, r9, lr}
203949 c00c89b4: e59f715c ldr r7, [pc, #348] ; c00c8b18 <chrdev_open+0x168>
203950 c00c89b8: e24dd00c sub sp, sp, #12
203951 c00c89bc: e5973000 ldr r3, [r7]
203952 c00c89c0: e1a08000 mov r8, r0
203953 c00c89c4: e1a09001 mov r9, r1
203954 c00c89c8: e58d3004 str r3, [sp, #4]
203955 c00c89cc: e5905130 ldr r5, [r0, #304] ; 0x130
阅读这一段汇编代码,可以得到lr = c00c09f0 检索内核反汇编文件。do_dentry_open->chrdev_open->sixth_drv_open
c00c0814 <do_dentry_open>:
195257 c00c0814: e92d41f0 push {r4, r5, r6, r7, r8, lr}
195258 c00c0818: e1a04000 mov r4, r0
195259 c00c081c: e1a05001 mov r5, r1
195260 c00c0820: e2806008 add r6, r0, #8
195261 c00c0824: e1a00006 mov r0, r6
195262 c00c0828: e1a07002 mov r7, r2
195263 c00c082c: eb003492 bl c00cda7c <path_get>
阅读这一段汇编代码,可以得到lr = c00d2214检索内核反汇编文件,以此类推
最终分析得到的和oops后面打印的信息肯定一样
[<bf0002dc>] (sixth_drv_open [buttons]) from [<c00c8a60>] (chrdev_open+0xb0/0x170)
[<c00c8a60>] (chrdev_open) from [<c00c09f0>] (do_dentry_open+0x1dc/0x380)
[<c00c09f0>] (do_dentry_open) from [<c00d2214>] (path_openat+0x504/0xf4c)
[<c00d2214>] (path_openat) from [<c00d3efc>] (do_filp_open+0x6c/0xe0)
[<c00d3efc>] (do_filp_open) from [<c00c1ef8>] (do_sys_open+0x128/0x1f4)
[<c00c1ef8>] (do_sys_open) from [<c0009000>] (ret_fast_syscall+0x0/0x50)