一次oops的定位过程
移植ipipe到4.19内核, 启动的时候内核发生了panic,信息如下:
[ 3.055412][ 0] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 3.064620][ 0] Mem abort info:
[ 3.067837][ 0] ESR = 0x86000005
[ 3.071314][ 0] Exception class = IABT (current EL), IL = 32 bits
[ 3.077656][ 0] SET = 0, FnV = 0
[ 3.081133][ 0] EA = 0, S1PTW = 0
[ 3.084696][ 0] [0000000000000000] user address but active_mm is swapper
[ 3.091474][ 0] Internal error: Oops: 86000005 [#1] SMP
[ 3.096775][ 0] Modules linked in:
[ 3.100253][ 0] Process swapper/0 (pid: 0, stack limit = 0x(____ptrval____))
[ 3.107379][ 0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.19.90-21.xeno+ #1
[ 3.115978][ 0] Hardware name: Lenovo 90NDZ57MCN/FUSHUN, BIOS M3QKT0AA 05/29/20 20:00:44
[ 3.124144][ 0] I-pipe domain: Linux
[ 3.127795][ 0] pstate: 60000085 (nZCv daIf -PAN -UAO)
[ 3.133008][ 0] pc : (null)
[ 3.136750][ 0] lr : __ipipe_ack_fasteoi_irq+0x28/0x38
[ 3.141963][ 0] sp : ffff8020ffe852f0
[ 3.145701][ 0] x29: ffff8020ffe852f0 x28: ffff00000948d000
[ 3.151436][ 0] x27: 00000000f153c674 x26: ffff8020ffe853c0
[ 3.157170][ 0] x25: 0000000000000000 x24: ffff0000094cb000
[ 3.162905][ 0] x23: 0000000000000000 x22: ffff8020fb734400
[ 3.168640][ 0] x21: ffff000009914f40 x20: 000000000000001e
[ 3.174375][ 0] x19: ffff8020fb734400 x18: 0000000000000001
[ 3.180109][ 0] x17: 00000000d3ef17b4 x16: 000000005941f902
[ 3.185844][ 0] x15: ffff80217e1dfb3f x14: 0000000000000000
[ 3.191579][ 0] x13: 0000000000000000 x12: ffff7fe0083faf80
[ 3.197313][ 0] x11: 0000000000000000 x10: 0000000000000ad0
[ 3.203048][ 0] x9 : ffff000009453e40 x8 : 0000000000000040
[ 3.208783][ 0] x7 : 0000000000000000 x6 : ffff8020ffd27d78
[ 3.214517][ 0] x5 : ffff8020fb734400 x4 : ffff8020fb734400
[ 3.220252][ 0] x3 : 0000000000000000 x2 : ffff0000081dd128
[ 3.225986][ 0] x1 : 0000000000000000 x0 : ffff8020fb734518
[ 3.231721][ 0] Call trace:
[ 3.234591][ 0] (null)
[ 3.237983][ 0] __ipipe_dispatch_irq+0x84/0x1d8
[ 3.242676][ 0] __ipipe_grab_irq+0x4c/0xc0
[ 3.246934][ 0] gic_handle_irq+0xd0/0x140
[ 3.251106][ 0] handle_arch_irq_pipelined+0x28/0x80
[ 3.256145][ 0] el1_irq+0xc8/0x180
[ 3.259709][ 0] ipipe_unstall_root+0x3c/0x58
[ 3.264141][ 0] arch_cpu_idle+0x48/0x1b8
[ 3.268228][ 0] default_idle_call+0x44/0x4c
[ 3.272573][ 0] do_idle+0x19c/0x258
[ 3.276223][ 0] cpu_startup_entry+0x28/0x30
[ 3.280569][ 0] rest_init+0xb8/0xc8
[ 3.284221][ 0] start_kernel+0x474/0x48c
根据调用栈,先调查函数__ipipe_dispatch_irq(kernel/ipipe/core.c)步骤如下:
1,在Makefile里KBUILD_CFLAGS增加-g选项,这样可以使反汇编代码和原C代码对应起来
2,编译
3,进入kernel/ipipe目录,键入命令objdump -D -S core.o > core.dis
4, 找到__ipipe_dispatch_irq
void __ipipe_dispatch_irq(unsigned int irq, int flags) /* hw interrupts off */
{
2068: a9bb7bfd stp x29, x30, [sp,#-80]!
206c: 910003fd mov x29, sp
2070: a90153f3 stp x19, x20, [sp,#16]
2074: a9025bf5 stp x21, x22, [sp,#32]
2078: a90363f7 stp x23, x24, [sp,#48]
207c: f90023f9 str x25, [sp,#64]
2080: 2a0003f4 mov w20, w0
2084: aa1e03e0 mov x0, x30
2088: 2a0103f7 mov w23, w1
208c: 94000000 bl 0 <_mcount>
#endif
/*
* CAUTION: on some archs, virtual IRQs may have acknowledge
* handlers. Multiplex IRQs should have one too.
*/
if (unlikely(irq >= IPIPE_NR_XIRQS)) {
2090: 710ffe9f cmp w20, #0x3ff
2094: 54000d68 b.hi 2240 <__ipipe_dispatch_irq+0x1d8>
desc = NULL;
........
20dc: 37000060 tbnz w0, #0, 20e8 <__ipipe_dispatch_irq+0x80>
ipd = ipipe_root_domain;
20e0: 90000015 adrp x21, 0 <ipipe_stall_root>
20e4: 910002b5 add x21, x21, #0x0
if (ipd->irqs[irq].ackfn)
20e8: 8b131ab5 add x21, x21, x19, lsl #6
20ec: f94026a1 ldr x1, [x21,#72]
20f0: b4000061 cbz x1, 20fc <__ipipe_dispatch_irq+0x94>
ipd->irqs[irq].ackfn(desc);
20f4: aa1603e0 mov x0, x22
20f8: d63f0020 blr x1
if (chained_irq) {
20fc: 350008b9 cbnz w25, 2210 <__ipipe_dispatch_irq+0x1a8>
根据信息__ipipe_dispatch_irq+0x84/0x1d8,其中0x1d8是函数长度,0x84是问题代码在函数中的偏移量,那么,问题代码的位置大概在2068+0x84=20ec,也就是源代码“ipd->irqs[irq].ackfn()”
根据代码“desc->ipipe_ack = __ipipe_ack_fasteoi_irq” (./kernel/irq/chip.c),可以确定__ipipe_ack_fasteoi_irq就是ipd->irqs[irq].ackfn
而且信息“lr : __ipipe_ack_fasteoi_irq+0x28/0x38”,也证明了这一点。
进入函数__ipipe_ack_fasteoi_irq
void __ipipe_ack_fasteoi_irq(struct irq_desc *desc)
{
desc->irq_data.chip->irq_hold(&desc->irq_data);
}
可以看到指针在使用之前没有检查,很有可能在移植的过程中,某些GIC的驱动不在ipipe patch的范围内,导致irq_hold为空。
修改方案 -- 增加保护
void __ipipe_ack_fasteoi_irq(struct irq_desc *desc)
{
- desc->irq_data.chip->irq_hold(&desc->irq_data);
+ if(desc && desc->irq_data.chip && desc->irq_data.chip->irq_hold)
+ desc->irq_data.chip->irq_hold(&desc->irq_data);
}
系统启动正常。