1. 问题现象
- 问题发生的Android系统版本是7.0(Nougat);
-
滑动屏幕和按键都无响应,屏幕内容没有任何刷新;
-
watchdog没有重启system_server;
-
问题现场可以连接adb;
2. 初步分析
对于死机问题,我们需要做一些分析前的准备工作:
(1)拿到问题现场,及时充电以保证问题现场不被破坏;
(2)如果没有现场可以忽略这一步,通过kill -3 后面跟上system_server pid命令产生一份最新的traces文件;
(3)如果最新的traces文件无法产生,则通过debuggerd -b $system_server pid打印出一份所有线程的Native调用栈到文件中;
(4)通过adb将/data/anr下的文件都pull出来;
(5)通过adb将/data/tombstones下的文件都pull出来;
问题现场通过kill -3命令没有产生最新时间点的traces文件,因此只能查看/data/anr下最新时间点的traces文件,但是发现traces文件中的时间点已经是昨天的:
----- pid 1487 at 2017-04-25 22:44:52 -----
Cmd line: system_server
并且昨天生成的这份traces文件中system_server的各个线程的状态都正常,没有明显的问题和block。
接着分析由debuggerd -b $system_server pid打印出的Native调用栈信息,首先查看watchdog线程当前所处的状态,为什么没有重启手机:
"watchdog" sysTid=1877
#00 pc 000000000001bf6c /system/lib64/libc.so (syscall+28)
#01 pc 00000000000e7ac8 /system/lib64/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+160)
#02 pc 000000000037ac68 /system/lib64/libart.so (_ZN3art7Monitor4WaitEPNS_6ThreadElibNS_11ThreadStateE+896)
#03 pc 000000000054e980 /system/framework/arm64/boot.oat (offset 0x54e000) (java.lang.Object.wait+140)
#04 pc 000000000054e8b8 /system/framework/arm64/boot.oat (offset 0x54e000) (java.lang.Object.wait+52)
#05 pc 00000000011035a8 /system/framework/oat/arm64/services.odex (offset 0xf0c000)
发现watchdog等待在ConditionVariable的WaitHoldingLocks方法上,为什么会等在这里?等在这里是否正常?
带着问题我们通过调用栈中的地址和addr2line工具层层定位具体的代码,首先是从Object的wait方法调用Monitor的Wait方法,具体代码如下:
/* art/runtime/monitor.cc */
579void Monitor::Wait(Thread* self, int64_t ms, int32_t ns,
580 bool interruptShouldThrow, ThreadState why) {
...
631
632 bool was_interrupted = false;
633 {
634 // Update thread state. If the GC wakes up, it'll ignore us, knowing
635 // that we won't touch any references in this state, and we'll check
636 // our suspend mode before we transition out.
637 ScopedThreadSuspension sts(self, why);
...
651
652 // Handle the case where the thread was interrupted before we called wait().
653 if (self->IsInterruptedLocked()) {
654 was_interrupted = true;
655 } else {
656 // Wait for a notification or a timeout to occur.
657 if (why == kWaiting) {
658 self->GetWaitConditionVariable()->Wait(self);
659 } else {
660 DCHECK(why == kTimedWaiting || why == kSleeping) << why;
661 self->GetWaitConditionVariable()->TimedWait(self, ms, ns);
662 }
663 was_interrupted = self->IsInterruptedLocked();
664 }
665 }
接着在Monitor的Wait方法中,调用self->GetWaitConditionVariable()->Wait或者TimedWait方法之前会通过ScopedThreadSuspension类的构造方法进行线程状态的切换,从Runable状态切换到Suspended状态,切换的具体代码如下:
/* art/runtime/scoped_thread_state_change.h */
280// Annotalysis helper for going to a suspended state from runnable.
281class ScopedThreadSuspension : public ValueObject {
282 public:
283 explicit ScopedThreadSuspension(Thread* self, ThreadState suspended_state)
...
{
288 DCHECK(self_ != nullptr);
289 self_->TransitionFromRunnableToSuspended(suspended_state);
290 }
随后self->GetWaitConditionVariable()->Wait或者TimedWait方法执行完,即等待条件满足或者超时后会继续往下执行,执行出了ScopedThreadSuspension对象sts所在代码块的作用域之后会执行ScopedThreadSuspension类的析构方法,在析构方法中会再次进行线程状态切换,从Suspended状态切换到Runable状态,切换的具体代码如下:
/* art/runtime/thread-inl.h */
172inline ThreadState Thread::TransitionFromSuspendedToRunnable() {
...
177 do {
...
195 } else if ((old_state_and_flags.as_struct.flags & kActiveSuspendBarrier) != 0) {
196 PassActiveSuspendBarriers(this);
197 }