我们在insmod xxx.ko的时候经常会遇到系统宕机,打印一堆oops信息,下面我根据自己遇到的一个案例来分析下我们如何定位此类问题?遇到问题不要慌,冷静下来一步步分析,就像上学时做数学题一样,一步步分析推理…
根据oops信息找到出问题的代码行
我的案例:oops信息如下
[ 40.389877] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 40.440158] Mem abort info:
[ 40.442967] ESR = 0x96000005
[ 40.446034] EC = 0x25: DABT (current EL), IL = 32 bits
[ 40.451357] SET = 0, FnV = 0
[ 40.454425] EA = 0, S1PTW = 0
[ 40.457579] Data abort info:
[ 40.460457] ISV = 0, ISS = 0x00000005
[ 40.464306] CM = 0, WnR = 0
[ 40.467271] user pgtable: 4k pages, 39-bit VAs, pgdp=00000001065c5000
[ 40.473716] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 40.482423] Internal error: Oops: 96000005 [#1] SMP
[ 40.487297] Modules linked in: uplatdrv(O+) iopin_complex(O)
[ 40.492964] CPU: 3 PID: 810 Comm: insmod Tainted: G O 5.10.110 #105
[ 40.500525] Hardware name: Rockchip RK3588 EVB4 LP4 V10 Board (DT)
[ 40.506698] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[ 40.512732] pc : free_dma_rx_desc_resources+0x54/0xf0 [uplatdrv]
[ 40.518758] lr : stmmac_release+0x2c/0x70 [uplatdrv]
[ 40.523716] sp : ffffffc013ed3810
[ 40.527027] x29: ffffffc013ed3810 x28: 0000000000000013
[ 40.532337] x27: ffffffc011f6f770 x26: ffffffc01011cde0
[ 40.537647] x25: 0000000000000006 x24: ffffff8100ee6410
[ 40.542956] x23: ffffff810266c480 x22: ffffff8105ff01c0
[ 40.548266] x21: ffffff8105ff0000 x20: ffffff8105ff00c0
[ 40.553575] x19: 0000000000000000 x18: 0000000000000001
[ 40.558885] x17: 0000000000000000 x16: 0000000000000000
[ 40.564194] x15: 0000000000000000 x14: ffffff81008244c0
[ 40.569504] x13: ffffffc2ecdf3000 x12: 0000000000000006
[ 40.574813] x11: ffffffc0118bd4c0 x10: 0000000000000000
[ 40.580123] x9 : ffffffc00919fbfc x8 : ffffffc0120e9200
[ 40.585432] x7 : 0000000000000018 x6 : ffffff81043f7000
[ 40.590741] x5 : ffffffc0120f6678 x4 : 0000000000000000
[ 40.596050] x3 : 0000000000000002 x2 : 0000000000000000
[ 40.601360] x1 : 0000000000000000 x0 : 0000000000000000
[ 40.606669] Call trace:
[ 40.609141] free_dma_rx_desc_resources+0x54/0xf0 [uplatdrv]
[ 40.614820] stmmac_release+0x2c/0x70 [uplatdrv]
[ 40.619454] rk_gmac_probe+0x720/0x820 [uplatdrv]
[ 40.624163] platform_drv_probe+0x60/0xb4
[ 40.628172] really_probe+0x110/0x514
[ 40.631833] driver_probe_device+0x7c/0x170
[ 40.636011] device_driver_attach+0xcc/0xd4
[ 40.640188] __driver_attach+0xd8/0x17c
[ 40.644027] bus_for_each_dev+0x7c/0xe0
[ 40.647867] driver_attach+0x30/0x40
[ 40.651442] bus_add_driver+0x12c/0x23c
[ 40.655282] driver_register+0x84/0x140
[ 40.659121] __platform_driver_register+0x54/0x60
[ 40.663850] rk_gmac_init_module+0x30/0x5c [uplatdrv]
[ 40.668927] drv_init+0x7c/0xf0 [uplatdrv]
[ 40.673022] do_one_initcall+0x68/0x290
[ 40.676863] do_init_module+0x50/0x25c
[ 40.680607] load_module+0x22dc/0x2a00
[ 40.684351] __do_sys_finit_module+0xd0/0x130
[ 40.688708] __arm64_sys_finit_module+0x2c/0x40
[ 40.693236] el0_svc_common.constprop.0+0xa4/0x2a0
[ 40.698026] do_el0_svc+0x78/0xa0
[ 40.701341] el0_svc+0x20/0x30
[ 40.704400] el0_sync_handler+0xe8/0xf0
[ 40.708239] el0_sync+0x1a0/0x1c0
[ 40.711550]
[ 40.711550] PC: 0xffffffc00919f4e8:
[ 40.716506] f2e8 f9400001 b4000121 f9406424 b40000e4 b9409002 52800003 52800021 7100005f
[ 40.724699] f308 1a9f07e2 d63f0080 f94ee280 52800013 29525c15 6b1302ff 54000180 f9402280
很明显这是一个访问空指针的问题,老生常谈了,遇到这个问题,首先想到的就是找出代码里,什么变量是空指针且访问了。
使用objdump反汇编,然后根据oops中的异常点找出偏移地址
异常的位置:pc : free_dma_rx_desc_resources+0x54/0xf0 [uplatdrv] (0x54是指从函数的基地址开始偏移0x54就是发生问题的地方,0xf0是指函数的大小)
反汇编指令:
objdump -D bin/xxxx.ko > 1.txt
查看1.txt找到free_dma_rx_desc_resources+0x54在这个ko的偏移地址是0x000094e8,其中free_dma_rx_desc_resources函数的基地址是0x00009494
使用addr2line来定位详细的代码行
addr2line -C -f -e xxxx.ko 0x94e8
stmmac/stmmac_main.c:429
注意:以上操作必须建立在ko文件在编译时打开调试选项的基础上