kernel hung问题定位分析

本文详细描述了如何改进系统对kernelhung问题的定位,通过获取线程的PC、LR、SP指针,统计utime、stime和gtime,以及检查last_switch_count和last_switch_time,以便更精确地追踪导致挂起的线程。内核线程可以获取backtrace,应用线程则需特殊处理以获取PC指针和状态信息。
摘要由CSDN通过智能技术生成

kernel hung问题定位分析

需求背景

之前的系统hung时,只能获取到线程的backtrace信息,及相关的线程内核态、用户态的总执行时间,只能定位到哪个应用线程触发了hung,无法再精确定位到范围。

实现方案分析

  1. 系统hung是输出线程的PC、LR、SP等指针信息;
  2. 获取线程在本轮hung检测的执行情况;
  3. 修改utime、stime、gtime的时间单位,改为ms,便于统计;

实现过程

//kernel/hung_task.c
static void debug_show_all_process(void)
{
	struct task_struct *g, *p;
	struct pt_regs *regs;
	u64 utime, stime, gtime;

	rcu_read_lock();
	for_each_process_thread(g, p) {
		pr_err("%s()-%d %20.20s/%04d on cpu%d policy:%u, prio:%d, state:0x%lx\n",
			__func__, __LINE__,
			p->comm, task_pid_nr(p), p->on_cpu,
			p->policy, p->prio, p->state);

		utime = p->utime;
		stime = p->stime;
		gtime = p->gtime;
		do_div(utime, 1000000);
		do_div(stime, 1000000);
		do_div(gtime, 1000000);
		pr_err("%s()-%d %20.20s/%04d utime:%016llu, stime:%016llu, gtime:%016llu\n",
			__func__, __LINE__,
			p->comm, task_pid_nr(p),
			utime, stime, gtime);

		pr_err("%s()-%d %20.20s/%04d lswc:%016ld, lswt:%016u\n",
			__func__, __LINE__,
			p->comm, task_pid_nr(p),
			p->last_switch_count, jiffies_to_msecs(p->last_switch_time));

		regs = task_pt_regs(p);
		pr_err("%s()-%d %20.20s/%04d pc = 0x%08lx, lr = 0x%08lx, sp = 0x%08lx, cpsr = 0x%08lx\n",
			__func__, __LINE__,
			p->comm, task_pid_nr(p),
			regs->ARM_pc, regs->ARM_lr, regs->ARM_sp, regs->ARM_cpsr);

		show_stack(p, NULL);
	}
	
	rcu_read_unlock();
}

修改说明:

  1. utime、stime、gtime:分别表示线程在用户态、内核态、持锁的时间,单位为ms;
  2. lswc:last_switch_count,hung task检测状态的最后一次switch cnt,一般为 t->nvcsw + t->nivcsw,为0的时候,说明其在hung_task_timerout内是有在运行过的
  3. lswt:last_switch_time,hung task检测状态时该task最后一次switch 时刻,单位为ms,为0的时候,说明其在hung_task_timerout内是有在运行过的**。**
    **

测试验证

现象一:应用线程可获取到PC指针,但lswc、lswt均为0

| [66763.057978] debug_show_all_process()-193 run_queue/0536 on cpu1 policy:2, prio:1, state:0x0
[66763.087433] debug_show_all_process()-204 run_queue/0536 utime:0000000013 420 660, stime:0000000001 023 872, gtime:0000000000000000
[66763.123497] debug_show_all_process()-209 run_queue/0536 lswc:0000000000000000, lswt:0000000000000000
[66763.195417] debug_show_all_process()-215 run_queue/0536 pc = 0xb6a04432, lr = 0xb6a04433, sp = 0xa72fec68, cpsr = 0xa0070030

[66763.260615] [] (__schedule) from [] (gic_data+0x0/0x6ac)

现象二:内核线程无法获取到PC指针,但lswc、lswt不为0

[66729.112082] debug_show_all_process()-193 main_wq_hdlr/0160
on cpu0 policy:2, prio:0, state:0x2 [66729.151779]
debug_show_all_process()-204 main_wq_hdlr/0160
utime:0000000000000000, stime:0000000009088756, gtime:0000000000000000
[66729.221647] debug_show_all_process()-209 main_wq_hdlr/0160 lswc:0000000314519330, lswt:0000000066279424
[66729.249777] debug_show_all_process()-215 main_wq_hdlr/0160 pc = 0x00000000, lr = 0x00000000, sp = 0x00000000,
cpsr = 0x00000013
[66729.285897] [] (__schedule) from
[] (schedule+0xa8/0xcc) [66729.307128] []
(schedule) from [] (schedule_timeout+0x24/0x230)
[66729.333192] [] (schedule_timeout) from []
(wait_for_common+0xdc/0x174) [66729.422103] []
(wait_for_common) from [] (mmc_wait_for_req_done+0x20/0x100)
[66729.478555] [] (mmc_wait_for_req_done) from []
(mmc_io_rw_extended+0x1a4/0x28c) [66729.573982] []
(mmc_io_rw_extended) from []
(sdio_io_rw_ext_helper+0x17c/0x1ac) [66729.603626] []
(sdio_io_rw_ext_helper) from [] (sdio_readsb+0x14/0x1c)
[66729.629588] [] (sdio_readsb) from []
(bl_read_data_sync+0x5c/0xc4 [bl_fdrv]) [66729.657059] []
(bl_read_data_sync [bl_fdrv]) from []
(bl_get_interrupt_status+0x28/0xe0 [bl_fdrv]) [66729.727505]
[] (bl_get_interrupt_status [bl_fdrv]) from []
(bl_main_process+0xa4/0xa24 [bl_fdrv]) [66729.759613] []
(bl_main_process [bl_fdrv]) from [] (main_wq_hdlr+0x38/0x58
[bl_fdrv]) [66729.790273] [] (main_wq_hdlr [bl_fdrv]) from
[] (kthread+0x110/0x128) [66729.857374] []
(kthread) from [] (ret_from_fork+0x14/0x2c) [66729.879826]
Exception stack(0xc695bfb0 to 0xc695bff8) [66729.894834] bfa0:
00000000 00000000 00000000 00000000 [66729.921942] bfc0: 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000
[66729.949322] bfe0: 00000000 00000000 00000000 00000000 00000013
00000000

现象三:应用线程获取到PC指针,lswc、lswt不为0

[66746.915857] debug_show_all_process()-193 demo_rplidar/0429
on cpu0 policy:0, prio:120, state:0x1 [66747.027954]
debug_show_all_process()-204 demo_rplidar/0429
utime:0000000000097656, stime:0000000000271868, gtime:0000000000000000
[66747.207064] debug_show_all_process()-209 demo_rplidar/0429 lswc:0000000005016612, lswt:0000000016 375 808
[66747.370209] debug_show_all_process()-215 demo_rplidar/0429 pc = 0xb64c9ca6, lr = 0xb64c7095, sp = 0xb5919dd0,
cpsr = 0x00040030
[66747.559655] [] (schedule) from
[] (futex_wait_queue_me+0xfc/0x180) [66747.590411]
[] (futex_wait_queue_me) from []
(futex_wait+0xe8/0x204) [66747.615753] [] (futex_wait) from
[] (do_futex+0xe4/0xa40) [66747.689348] []
(do_futex) from [] (sys_futex+0x144/0x194) [66747.714307]
[] (sys_futex) from [] (ret_fast_syscall+0x0/0x54)
[66747.736860] Exception stack(0xc5815fa8 to 0xc5815ff0)
[66747.754790] 5fa0: 00000000 ffffffff 000a70b0
00000189 00000001 00000000 [66747.793418] 5fc0: 00000000 ffffffff
00000000 000000f0 000a70c7 000f4240 00000000 000a70b0 [66747.851050]
5fe0: 000000f0 b5919dd0 b64c7095 b64c9ca6

总结归纳

  1. 内核线程无法直接获取到PC等指针,但可获取到其backtrace,来找到对应的PC指针;
  2. 触发hung的task,其lswc、lswt均不为0;
  3. 可获取到线程的PC等指针信息;
  4. 无法获取到应用线程的backtrace,每个进程有独立的地址空间,CPU看到的虚拟地址,通过当前应用线程的PC指针无法获取到其backtrace;
  • 21
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值