kernel hung问题定位分析
需求背景
之前的系统hung时,只能获取到线程的backtrace信息,及相关的线程内核态、用户态的总执行时间,只能定位到哪个应用线程触发了hung,无法再精确定位到范围。
实现方案分析
- 系统hung是输出线程的PC、LR、SP等指针信息;
- 获取线程在本轮hung检测的执行情况;
- 修改utime、stime、gtime的时间单位,改为ms,便于统计;
实现过程
//kernel/hung_task.c
static void debug_show_all_process(void)
{
struct task_struct *g, *p;
struct pt_regs *regs;
u64 utime, stime, gtime;
rcu_read_lock();
for_each_process_thread(g, p) {
pr_err("%s()-%d %20.20s/%04d on cpu%d policy:%u, prio:%d, state:0x%lx\n",
__func__, __LINE__,
p->comm, task_pid_nr(p), p->on_cpu,
p->policy, p->prio, p->state);
utime = p->utime;
stime = p->stime;
gtime = p->gtime;
do_div(utime, 1000000);
do_div(stime, 1000000);
do_div(gtime, 1000000);
pr_err("%s()-%d %20.20s/%04d utime:%016llu, stime:%016llu, gtime:%016llu\n",
__func__, __LINE__,
p->comm, task_pid_nr(p),
utime, stime, gtime);
pr_err("%s()-%d %20.20s/%04d lswc:%016ld, lswt:%016u\n",
__func__, __LINE__,
p->comm, task_pid_nr(p),
p->last_switch_count, jiffies_to_msecs(p->last_switch_time));
regs = task_pt_regs(p);
pr_err("%s()-%d %20.20s/%04d pc = 0x%08lx, lr = 0x%08lx, sp = 0x%08lx, cpsr = 0x%08lx\n",
__func__, __LINE__,
p->comm, task_pid_nr(p),
regs->ARM_pc, regs->ARM_lr, regs->ARM_sp, regs->ARM_cpsr);
show_stack(p, NULL);
}
rcu_read_unlock();
}
修改说明:
- utime、stime、gtime:分别表示线程在用户态、内核态、持锁的时间,单位为ms;
- lswc:last_switch_count,hung task检测状态的最后一次switch cnt,一般为 t->nvcsw + t->nivcsw,为0的时候,说明其在hung_task_timerout内是有在运行过的。
- lswt:last_switch_time,hung task检测状态时该task最后一次switch 时刻,单位为ms,为0的时候,说明其在hung_task_timerout内是有在运行过的**。**
**
测试验证
现象一:应用线程可获取到PC指针,但lswc、lswt均为0
| [66763.057978] debug_show_all_process()-193 run_queue/0536 on cpu1 policy:2, prio:1, state:0x0
[66763.087433] debug_show_all_process()-204 run_queue/0536 utime:0000000013 420 660, stime:0000000001 023 872, gtime:0000000000000000
[66763.123497] debug_show_all_process()-209 run_queue/0536 lswc:0000000000000000, lswt:0000000000000000
[66763.195417] debug_show_all_process()-215 run_queue/0536 pc = 0xb6a04432, lr = 0xb6a04433, sp = 0xa72fec68, cpsr = 0xa0070030
[66763.260615] [] (__schedule) from [] (gic_data+0x0/0x6ac) |
---|
现象二:内核线程无法获取到PC指针,但lswc、lswt不为0
[66729.112082] debug_show_all_process()-193 main_wq_hdlr/0160
on cpu0 policy:2, prio:0, state:0x2 [66729.151779]
debug_show_all_process()-204 main_wq_hdlr/0160
utime:0000000000000000, stime:0000000009088756, gtime:0000000000000000
[66729.221647] debug_show_all_process()-209 main_wq_hdlr/0160 lswc:0000000314519330, lswt:0000000066279424
[66729.249777] debug_show_all_process()-215 main_wq_hdlr/0160 pc = 0x00000000, lr = 0x00000000, sp = 0x00000000,
cpsr = 0x00000013 [66729.285897] [] (__schedule) from
[] (schedule+0xa8/0xcc) [66729.307128] []
(schedule) from [] (schedule_timeout+0x24/0x230)
[66729.333192] [] (schedule_timeout) from []
(wait_for_common+0xdc/0x174) [66729.422103] []
(wait_for_common) from [] (mmc_wait_for_req_done+0x20/0x100)
[66729.478555] [] (mmc_wait_for_req_done) from []
(mmc_io_rw_extended+0x1a4/0x28c) [66729.573982] []
(mmc_io_rw_extended) from []
(sdio_io_rw_ext_helper+0x17c/0x1ac) [66729.603626] []
(sdio_io_rw_ext_helper) from [] (sdio_readsb+0x14/0x1c)
[66729.629588] [] (sdio_readsb) from []
(bl_read_data_sync+0x5c/0xc4 [bl_fdrv]) [66729.657059] []
(bl_read_data_sync [bl_fdrv]) from []
(bl_get_interrupt_status+0x28/0xe0 [bl_fdrv]) [66729.727505]
[] (bl_get_interrupt_status [bl_fdrv]) from []
(bl_main_process+0xa4/0xa24 [bl_fdrv]) [66729.759613] []
(bl_main_process [bl_fdrv]) from [] (main_wq_hdlr+0x38/0x58
[bl_fdrv]) [66729.790273] [] (main_wq_hdlr [bl_fdrv]) from
[] (kthread+0x110/0x128) [66729.857374] []
(kthread) from [] (ret_from_fork+0x14/0x2c) [66729.879826]
Exception stack(0xc695bfb0 to 0xc695bff8) [66729.894834] bfa0:
00000000 00000000 00000000 00000000 [66729.921942] bfc0: 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000
[66729.949322] bfe0: 00000000 00000000 00000000 00000000 00000013
00000000
现象三:应用线程获取到PC指针,lswc、lswt不为0
[66746.915857] debug_show_all_process()-193 demo_rplidar/0429
on cpu0 policy:0, prio:120, state:0x1 [66747.027954]
debug_show_all_process()-204 demo_rplidar/0429
utime:0000000000097656, stime:0000000000271868, gtime:0000000000000000
[66747.207064] debug_show_all_process()-209 demo_rplidar/0429 lswc:0000000005016612, lswt:0000000016 375 808
[66747.370209] debug_show_all_process()-215 demo_rplidar/0429 pc = 0xb64c9ca6, lr = 0xb64c7095, sp = 0xb5919dd0,
cpsr = 0x00040030 [66747.559655] [] (schedule) from
[] (futex_wait_queue_me+0xfc/0x180) [66747.590411]
[] (futex_wait_queue_me) from []
(futex_wait+0xe8/0x204) [66747.615753] [] (futex_wait) from
[] (do_futex+0xe4/0xa40) [66747.689348] []
(do_futex) from [] (sys_futex+0x144/0x194) [66747.714307]
[] (sys_futex) from [] (ret_fast_syscall+0x0/0x54)
[66747.736860] Exception stack(0xc5815fa8 to 0xc5815ff0)
[66747.754790] 5fa0: 00000000 ffffffff 000a70b0
00000189 00000001 00000000 [66747.793418] 5fc0: 00000000 ffffffff
00000000 000000f0 000a70c7 000f4240 00000000 000a70b0 [66747.851050]
5fe0: 000000f0 b5919dd0 b64c7095 b64c9ca6
总结归纳
- 内核线程无法直接获取到PC等指针,但可获取到其backtrace,来找到对应的PC指针;
- 触发hung的task,其lswc、lswt均不为0;
- 可获取到线程的PC等指针信息;
- 无法获取到应用线程的backtrace,每个进程有独立的地址空间,CPU看到的虚拟地址,通过当前应用线程的PC指针无法获取到其backtrace;