现象:
待机时反复按power键休眠会有概率性重启,概率 >= 50%,但连接USB线情况下不会出现,所有怀疑是系统进入休眠才发生了异常,取出mtklog看到有db产生,确实发生了KE(kernel exception),取出db 和 vmlinux (必须是和当前软件是同一次编译的)后,使用GAT工具解开db,取出 SYS_MINI_RDUMP,使用 gdb 调试:
android@c0490:~/log/6601/20151218-KE$ arm-linux-androideabi-gdb vmlinux SYS_MINI_RDUMP
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-linux-gnu --target=arm-linux-android".
For bug reporting instructions, please see:
<http://source.android.com/source/report-bugs.html>...
Reading symbols from /home/android/log/6601/20151218-KE/vmlinux...done.
[New LWP 100]
[New LWP 1]
[New LWP 2]
[New LWP 3]
[New LWP 4]
[New LWP 5]
[New LWP 6]
[New LWP 7]
[New LWP 8]
Core was generated by `console=tty0 console=ttyMT0,921600n1 root=/dev/ram vmalloc=496M slub_max_order='.
#0 device_prepare (dev=<optimized out>, state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1483
1483 callback = dev->bus->pm->prepare;
(gdb) list
1478 } else if (dev->class && dev->class->pm) {
1479 info = "preparing class ";
1480 callback = dev->class->pm->prepare;
1481 } else if (dev->bus && dev->bus->pm) {
1482 info = "preparing bus ";
1483 callback = dev->bus->pm->prepare;
1484 }
1485
1486 if (!callback && dev->driver && dev->driver->pm) {
1487 info = "preparing driver ";
(gdb) bt
#0 device_prepare (dev=<optimized out>, state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1483
#1 dpm_prepare (state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1520
#2 0xc0327cc8 in dpm_suspend_start (state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1556
#3 0xc0090390 in suspend_devices_and_enter (state=0) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/power/suspend.c:273
#4 0xc0090a50 in enter_state (state=3) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/power/suspend.c:426
#5 0xc0090ac0 in pm_suspend (state=0) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/power/suspend.c:464
#6 0xc0098b24 in try_to_suspend (work=<optimized out>) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/power/autosleep.c:81
#7 0xc0042b60 in process_one_work (worker=0xd79a9600, work=0xc0e056e4 <suspend_work>)
at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/workqueue.c:2216
#8 0xc0043098 in worker_thread (__worker=0x0 <__vectors_start>) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/workqueue.c:2348
#9 0xc004c0bc in kthread (_create=0xdb1ffdf0) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/kthread.c:200
#10 0xc000f308 in ret_from_fork () at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/arch/arm/kernel/entry-common.S:91
#11 0xc000f308 in ret_from_fork () at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/arch/arm/kernel/entry-common.S:91
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
看到这个KE调用栈,心情是比较压抑的o(╯□╰)o,感觉就不太爽,因为里面都是kernel代码错误,并没有直接给出可以定位到具体的客制化代码位置,这种问题明显是原本是A处埋的雷可却在B处炸了!
本来A处埋了地雷,如果在A处炸了,那么通常来说问题还比较容易定位和解决。但是结果却是在B处炸了,然而B和A之间距离有多远呢?从这个bt是难以定位出来的,
所以使用有限的debug线索实现从B抓到A的过程成了本次解题的关键之关键!!!
static int device_prepare(struct device *dev, pm_message_t state)
{
int (*callback)(struct device *) = NULL;
char *info = NULL;
int error = 0;
if (dev->power.syscore)
return 0;
/*
* If a device's parent goes into runtime suspend at the wrong time,
* it won't be possible to resume the device. To prevent this we
* block runtime suspend here, during the prepare phase, and allow
* it again during the complete phase.
*/
pm_runtime_get_noresume(dev);
device_lock(dev);
dev->power.wakeup_path = device_may_wakeup(dev);
if (dev->pm_domain) {
info = "preparing power domain ";
callback = dev->pm_domain->ops.prepare;
} else if (dev->type && dev->type->pm) {
info = "preparing type ";
callback = dev->type->pm->prepare;
} else if (dev->class && dev->class->pm) {
info = "preparing class ";
callback = dev->class->pm->prepare;
} else if (dev->bus && dev->bus->pm) {
info = "preparing bus ";
callback = dev->bus->pm->prepare;
}
从调用栈可以知道死机的具体kernel代码位置,但是为什么会导致死机呢?是空指针,还是野指针导致?这个就需要分析汇编指令了,下面分析发生死机的原因,
使用gdb输出崩溃时的关键汇编指令:
(gdb) disassemble
Dump of assembler code for function dpm_prepare:
...
0xc0327ba8 <+252>: mov r0, r8
0xc0327bac <+256>: bl 0xc0998e1c <mutex_unlock>
0xc0327bb0 <+260>: b 0xc0327ad4 <dpm_prepare+40>
0xc0327bb4 <+264>: ldr r3, [r4, #-128] ; 0x80
0xc0327bb8 <+268>: cmp r3, #0
0xc0327bbc <+272>: beq 0xc0327c18 <dpm_prepare+364>
0xc0327bc0 <+276>: ldr r3, [r3, #20]
0xc0327bc4 <+280>: cmp r3, #0
0xc0327bc8 <+284>: beq 0xc0327c18 <dpm_prepare+364>
=> 0xc0327bcc <+288>: ldr r9, [r3]
0xc0327bd0 <+292>: cmp r9, #0
0xc0327bd4 <+296>: bne 0xc0327b98 <dpm_prepare+236>
0xc0327bd8 <+300>: ldr r3, [r4, #-48] ; 0x30
0xc0327bdc <+304>: cmp r3, #0
0xc0327be0 <+308>: beq 0xc0327ba8 <dpm_prepare+252>
0xc0327be4 <+312>: ldr r3, [r3, #52] ; 0x34
---Type <return> to continue, or q <return> to quit---q
Quit
汇编指令打印出来有点长,但我们只看关键的部分,如上的 => 提示指令:
=> 0xc0327bcc <+288>: ldr r9, [r3]
gdb 工具已标记出进程是运行到这条 ldr(Memory Access) 指令时挂了!这个时候我们需要知道此刻的各个寄存器内容才能判断分析KE的原因:
(gdb) info reg
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x6b6b6b6b 1802201963
r4 0xd90680b0 3641082032
r5 0xd9068000 3641081856
r6 0xc0e1c280 3236020864
r7 0xc0e1c2e0 3236020960
r8 0xd9068034 3641081908
r9 0x0 0
r10 0xc0f07f3c 3236986684
r11 0xd7b83d1c 3619175708
r12 0xd7b83c88 3619175560
sp 0xd7b83cf0 0xd7b83cf0
lr 0xc099767c 3231282812
pc 0xc0327bcc 0xc0327bcc <dpm_prepare+288>
cpsr 0x200e0013 537788435
(gdb)
到现在这一步,死机的原因也明确了,那么关键是怎么根据现有的线索抓到最终导致问题的真凶呢?
基本思路就是:找出出错指针 dev->bus->pm 的地址 &
dev->bus->pm ,根据这个出错指针的地址来想办法抓到真凶.
现在需要打印出汇编-C混合代码供分析,下面这段代码很关键:
1481 } else if (dev->bus && dev->bus->pm) {
0xc0327c30 <+388>: ldr r3, [r4, #-52] ; 0x34
0xc0327c34 <+392>: cmp r3, #0
0xc0327c38 <+396>: beq 0xc0327bd8 <dpm_prepare+300>
0xc0327c3c <+400>: ldr r3, [r3, #52] ; 0x34
0xc0327c40 <+404>: cmp r3, #0
0xc0327c44 <+408>: bne 0xc0327bcc <dpm_prepare+288>
0xc0327c48 <+412>: b 0xc0327bd8 <dpm_prepare+300>
1482 info = "preparing bus ";
1483 callback = dev->bus->pm->prepare;
=> 0xc0327bcc <+288>: ldr r9, [r3]
dev->bus ==> r4 - #52 == 0xd906807c
dev->bus->pm ==> r4 == 0xd90680b0
也可以使用gdb直接打印出来验证分析是否正确:
(gdb) p dev
$2 = (struct device *) 0xd9068000
(gdb) p &dev->bus
$3 = (struct bus_type **) 0xd906807c
(gdb) p &dev->bus->pm
$4 = (const struct dev_pm_ops **) 0x34
我们看到&dev->bus-pm的值是个0x34,其实这个值就是pm成员在bus结构体中的偏移刚好就是0x34,所以上面的分析的正确的,接下来就要根据dev->bus->pm的地址找到具体是哪里给出来的了,我们需要用gdb分析调用栈分析dev是从哪里传过来的,如下frame 0 是指最顶层的栈帧,通常被损坏了无法打印出当时的变量值,需要看上一级frame 1的栈帧分析:
(gdb) f 0
#0 device_prepare (dev=<optimized out>, state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1483
1483 callback = dev->bus->pm->prepare;
(gdb) p dev
$1 = <optimized out>
(gdb) f 1
#1 dpm_prepare (state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1520
1520 error = device_prepare(dev, state);
(gdb) list
1515 struct device *dev = to_device(dpm_list.next);
1516
1517 get_device(dev);
1518 mutex_unlock(&dpm_list_mtx);
1519
1520 error = device_prepare(dev, state);
1521
1522 mutex_lock(&dpm_list_mtx);
1523 if (error) {
1524 if (error == -EAGAIN) {
(gdb) p dev
$2 = (struct device *) 0xd9068000
(gdb) p &dev->bus
$3 = (struct bus_type **) 0xd906807c
(gdb) p &dev->bus->pm
$4 = (const struct dev_pm_ops **) 0x34
(gdb) p dpm_list
$5 = {next = 0xd90680b0, prev = 0xdb1df5b0}
在frame 1栈帧里可以看到此刻的dev具体地址,我们发现dev是从链表 dpm_list得来的,而且此刻出问题的dpm_list.next ==
0xd90680b0!!!,发现这个值很眼熟,==》就是 r4的值,也是刚刚要找的&dev->bus->pm的值!为什么会这样呢?
下面分析源代码:
int dpm_prepare(pm_message_t state)
{
int error = 0;
might_sleep();
mutex_lock(&dpm_list_mtx);
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
get_device(dev);
mutex_unlock(&dpm_list_mtx);
error = device_prepare(dev, state);
mutex_lock(&dpm_list_mtx);
if (error) {
if (error == -EAGAIN) {
put_device(dev);
error = 0;
continue;
}
printk(KERN_INFO "PM: Device %s not prepared "
"for power transition: code %d\n",
dev_name(dev), error);
put_device(dev);
break;
}
dev->power.is_prepared = true;
if (!list_empty(&dev->power.entry))
list_move_tail(&dev->power.entry, &dpm_prepared_list);
put_device(dev);
}
mutex_unlock(&dpm_list_mtx);
return error;
}
/**
* device_pm_add - Add a device to the PM core's list of active devices.
* @dev: Device to add to the list.
*/
void device_pm_add(struct device *dev)
{
pr_debug("PM: Adding info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus", dev_name(dev));
mutex_lock(&dpm_list_mtx);
if (dev->parent && dev->parent->power.is_prepared)
dev_warn(dev, "parent %s should not be sleeping\n",
dev_name(dev->parent));
list_add_tail(&dev->power.entry, &dpm_list);
mutex_unlock(&dpm_list_mtx);
}
到这一步,心情总算是好点了,感觉就要快抓到它了!
void device_pm_add(struct device *dev)
{
pr_debug("PM: Adding info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus", dev_name(dev));
pr_debug("[KE],[%s],[0x%lx],[0x%lx],[0x%lx],[0x%lx],[%s]\n",
__func__,
(long)dev,
(long)&dev->bus,
(long)&dev->bus->pm,
(long)&dev->power.entry,
dev_name(dev));</span>
mutex_lock(&dpm_list_mtx);
if (dev->parent && dev->parent->power.is_prepared)
dev_warn(dev, "parent %s should not be sleeping\n",
dev_name(dev->parent));
list_add_tail(&dev->power.entry, &dpm_list);
mutex_unlock(&dpm_list_mtx);
}
编译kernel、烧机、抓开机log :
[ 21.491480]<2>.(2)[1:swapper/0]PM: Adding info for No Bus:silead_fp_dev
[ 21.492349]<2>.(2)[1:swapper/0][KE],[device_pm_add],[0xd9068000],[0xd906807c],[0x34],[0xd90680b0],[silead_fp_dev]
看到这个结果,心情马上就好了^_^ ! 因为造成死机重启的真凶露面了!------>>>> silead_fp_dev
看到这个设备其实就可以去代码里面搜索关键字找到代码了,但是能否来得跟直接一点呢?能否直接定位到具体什么时候执行的呢?当然答案是肯定的,加上dump_stack() 函数就行了:
void device_pm_add(struct device *dev)
{
pr_debug("PM: Adding info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus", dev_name(dev));
pr_debug("[KE],[%s],[0x%lx],[0x%lx],[0x%lx],[0x%lx],[%s]\n",
__func__,
(long)dev,
(long)&dev->bus,
(long)&dev->bus->pm,
(long)&dev->power.entry,
dev_name(dev));
if(strcmp(dev_name(dev), "silead_fp_dev") == 0)
{
pr_debug("[KE]:[0x%lx]\n",(long)dev);
dump_stack();
}
mutex_lock(&dpm_list_mtx);
if (dev->parent && dev->parent->power.is_prepared)
dev_warn(dev, "parent %s should not be sleeping\n",
dev_name(dev->parent));
list_add_tail(&dev->power.entry, &dpm_list);
mutex_unlock(&dpm_list_mtx);
}
编译kernel、烧机、抓开机log :
[ 21.493949]<2>.(2)[1:swapper/0][KE]:[0xd9068000]
[ 21.494539]Backtrace:
[ 21.494883]<2>.(2)[1:swapper/0][<c0013a48>] (dump_backtrace+0x0/0x110) from [<c0984660>] (dump_stack+0x18/0x1c)
[ 21.496161] r6:d9068000 r5:d90680b0 r4:d9068000 r3:dbc3d000
[ 21.496891]<2>.(2)[1:swapper/0][<c0984648>] (dump_stack+0x0/0x1c) from [<c032713c>] (device_pm_add+0x118/0x120)
[ 21.498191]<2>.(2)[1:swapper/0][<c0327024>] (device_pm_add+0x0/0x120) from [<c031c958>] (device_add+0x4e0/0x5d0)
[ 21.499470] r5:c0df60c8 r4:d9068008
[ 21.499945]<2>.(2)[1:swapper/0][<c031c478>] (device_add+0x0/0x5d0) from [<c031d278>] (device_create_vargs+0xa8/0xc0)
[ 21.501304]<2>.(2)[1:swapper/0][<c031d1d0>] (device_create_vargs+0x0/0xc0) from [<c031d2b8>] (device_create+0x28/0x30)
[ 21.502648] r9:00000000 r8:c14698c0 r7:dbcd72a8 r6:c165b144 r5:d9046a00
r4:dbcd7280
[ 21.503754]<2>.(2)[1:swapper/0][<c031d290>] (device_create+0x0/0x30) from [<c0678bb0>] (spidev_probe+0x2fc/0x528)
[ 21.505076]<2>.(2)[1:swapper/0][<c06788b4>] (spidev_probe+0x0/0x528) from [<c06954f0>] (spi_drv_probe+0x20/0x24)
[ 21.505076]<2>.(2)[1:swapper/0][<c06788b4>] (spidev_probe+0x0/0x528) from [<c06954f0>] (spi_drv_probe+0x20/0x24)
结合addr2Line工具可以解析出backtrace对于的函数所在文件行号,然后直接去源代码搜索相关函数就可以了.
到这一步已经大大缩小了异常的范围到一个设备驱动代码上了,
已完成了从汇编代码到源代码转换的过程,此题已解70%,
接下来就需要分析
silead_fp_dev
代码 logic了,找出具体异常原因和解决问题了,剩下内容
待下一章分析
^_^ 。