Android Watchdog介绍和实例分析

1. Watchdog是什么?

Android Watchdog主要是检测system server重要线程状态(main,android io,android display ,Activitymanager,android fg,android bg等线程 )的一种机制。发生Watchdog时会触发Android重启来使系统恢复。Watchdog是SystemServer的一个线程(class Watchdog extends Thread),检测system server重要线程的锁状态和消息是否阻塞,假如有线程block了60s那么就会触发watchdog timeout flow。

2. Watchdog代码基本介绍

可以在http://androidxref.com/搜索源码,看Watchdog.java文件

public class Watchdog extends Thread {
    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
    ...

    private Watchdog() {
        super("watchdog");
        //将前台线程加入队列
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        //将主线程加入队列
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        //将ui线程加入队列
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        //将i/o线程加入队列
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        //将display线程加入队列
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        addMonitor(new BinderThreadMonitor());
    }

}

ActivityManagerService添加

@frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
3184        Watchdog.getInstance().addMonitor(this);
3185        Watchdog.getInstance().addThread(mHandler);

/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
    synchronized (this) { }
}

Watchdog线程循环检测

    @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            ...
            //java trace
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, getInterestingNativePids());


            //sysrq
            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            //dropbox
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            ...
            //打印LOG WATCHDOG KILLING SYSTEM PROCESS
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                Slog.w(TAG, "*** GOODBYE!");
                // kill自己(SystemServer)重启
                Process.killProcess(Process.myPid());  
                System.exit(10);
            }

            waitedHalf = false;
        }
    }
private static final class BinderThreadMonitor implements Watchdog.Monitor {
    @Override
    public void monitor() {
        Binder.blockUntilThreadAvailable();
    }
}

BinderThread耗尽,就会Blocked在这里

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
                static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
                static_cast<unsigned long>(mProcess->mMaxThreads));
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

3. 原因分析

  • (1)手机性能影响主要一下:

cpu使用高 –查看cpu使用,看是否有问题

low memory –通过kernel log搜索lowmemorykiller看cache和free的内存是否够,是否有memory泄露

emmc问题(io文件)–block的trace应该在io操作时,可以kernellog搜索是否有emmc错误

dump log –在发生anr之前系统在dump log导致花费时间

object leakage –比如fd leak,window leakage,service leakage,receiver leakage,window leakage

  • (2)非性能影响,可能原因如下,看backtrace来分析

4. 实例分析

例1:SystemServer Binder耗尽
10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!
例2:IO阻塞
W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.wm.WindowManagerService on foreground thread (android.fg)

Watchdog告诉我们Monitor Checker超时了,具体在哪呢? 名为android.fg的线程在WindowManagerService的monitor()方法被阻塞了。这里隐含了两层意思:

WindowManagerService实现了Watchdog.Monitor这个接口,并将自己作为Monitor Checker的对象加入到了Watchdog的监测集中

monitor()方法是运行在android.fg线程中的。Android将android.fg设计为一个全局共享的线程,意味着它的消息队列可以被其他线程共享, Watchdog的Monitor Checker就是使用的android.fg线程的消息队列。因此,出现Monitor Checker的超时,肯定是android.fg线程阻塞在monitor()方法上。

我们打开system_server进程的traces,检索 android.fg 可以快速定位到该线程的函数调用栈:

"android.fg" prio=5 tid=25 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x12eef900 self=0x7f7a8b1000
  | sysTid=973 nice=0 cgrp=default sched=0/0 handle=0x7f644e9000
  | state=S schedstat=( 3181688530 2206454929 8991 ) utm=251 stm=67 core=1 HZ=100
  | stack=0x7f643e7000-0x7f643e9000 stackSize=1036KB
  | held mutexes=
  at com.android.server.wm.WindowManagerService.monitor(WindowManagerService.java:13125)
  - waiting to lock <0x126dccb8> (a java.util.HashMap) held by thread 91
  at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:204)
  at android.os.Handler.handleCallback(Handler.java:815)
  at android.os.Handler.dispatchMessage(Handler.java:104)
  at android.os.Looper.loop(Looper.java:194)
  at android.os.HandlerThread.run(HandlerThread.java:61)
  at com.android.server.ServiceThread.run(ServiceThread.java:46)

android.fg线程调用栈告诉我们几个关键的信息:

-这个线程当前的状态是Blocked,阻塞
-由Watchdog发起调用monitor(),这是一个Watchdog检查,阻塞已经超时
-waiting to lock <0x126dccb8>: 阻塞的原因是monitor()方法中在等锁<0x126dccb8>
-held by thread 91: 这个锁被编号为91的线程持有,需要进一步观察91号线程的状态。

可以在traces.txt文件中检索 tid=91 来快速找到91号线程的函数调用栈信息:

"Binder_C" prio=5 tid=91 Native
  | group="main" sCount=1 dsCount=0 obj=0x12e540a0 self=0x7f63289000
  | sysTid=1736 nice=0 cgrp=default sched=0/0 handle=0x7f6127c000
  | state=S schedstat=( 96931835222 49673449591 260122 ) utm=7046 stm=2647 core=2 HZ=100
  | stack=0x7f5ffbc000-0x7f5ffbe000 stackSize=1008KB
  | held mutexes=
  at libcore.io.Posix.writeBytes(Native method)
  at libcore.io.Posix.write(Posix.java:258)
  at libcore.io.BlockGuardOs.write(BlockGuardOs.java:313)
  at libcore.io.IoBridge.write(IoBridge.java:537)
  at java.io.FileOutputStream.write(FileOutputStream.java:186)
  at com.android.internal.util.FastPrintWriter.flushBytesLocked(FastPrintWriter.java:334)
  at com.android.internal.util.FastPrintWriter.flushLocked(FastPrintWriter.java:355)
  at com.android.internal.util.FastPrintWriter.appendLocked(FastPrintWriter.java:303)
  at com.android.internal.util.FastPrintWriter.print(FastPrintWriter.java:466)
  - locked <@addr=0x134c4910> (a com.android.internal.util.FastPrintWriter$DummyWriter)
  at com.android.server.wm.WindowState.dump(WindowState.java:1510)
  at com.android.server.wm.WindowManagerService.dumpWindowsNoHeaderLocked(WindowManagerService.java:12279)
  at com.android.server.wm.WindowManagerService.dumpWindowsLocked(WindowManagerService.java:12266)
  at com.android.server.wm.WindowManagerService.dump(WindowManagerService.java:12654)
  - locked <0x126dccb8> (a java.util.HashMap)
  at android.os.Binder.dump(Binder.java:324)
  at android.os.Binder.onTransact(Binder.java:290)

91号线程的名字是Binder_C,它的函数调用栈告诉我们几个关键信息:

  • Native,表示线程处于运行状态(RUNNING),并且正在执行JNI方法
  • 在WindowManagerService.dump()方法申请了锁<0x126dccb8>,这个锁正是android.fg线程所等待的
  • FileOutputStream.write()表示Binder_C线程在执行IO写操作,正式因为这个写操作一直在阻塞,导致线程持有的锁不能释放

聪明的你看到这或许已经能够想到解决办法了,在这个IO写操作上加一个超时机制,并且这个超时小于Watchdog的超时,不就可以让线程释放它所占有的锁了吗? 是的,这确实可以作为一个临时解决方案(Workaround),或者说一个保护机制。但我们可以再往深处想一想,这个IO写操作为什么会阻塞:

  • 是不是IO缓冲区满了,导致写阻塞呢?
  • 是不是写操作有什么锁,导致这个write方法在等锁呢?
  • 是不是当前系统的IO负载过于高,导致写操作效率很低呢?

这都需要我们再进一步从日志中去找原因。如果已有的日志不全,找不到论据,我们还需要设计场景来验证假设,解决问题的难度陡然上升。

例3:Thread.sleep阻塞
//watchdog half
08-17 02:59:46.824   906  1384 I Watchdog: WAITED_HALF

//dumpStackTraces half watchdog(data/anr)
08-17 02:59:46.949   906  1384 I ActivityManager: dumpStackTraces pids=null nativepids=[512, 658, 643, 819, 645, 646, 663, 828, 653, 510, 481, 508, 672, 797, 798, 806, 810, 811, 812, 815, 819, 828, 1504]
08-17 02:59:46.959   906  1384 I ActivityManager: Dumping to /data/anr/anr_2020-08-17-02-59-46-955

//发生watchdog
08-17 03:00:21.784   906  1384 I watchdog: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg)

dumpStackTraces watchdog(data/anr)
08-17 03:00:21.876   906  1384 I ActivityManager: dumpStackTraces pids=null nativepids=[512, 658, 643, 819, 645, 646, 663, 828, 653, 510, 481, 508, 672, 797, 798, 806, 810, 811, 812, 815, 819, 828, 1504]
08-17 03:00:21.880   906  1384 I ActivityManager: Dumping to /data/anr/anr_2020-08-17-03-00-21-878

//dump sysrq(/proc/sysrq-trigger)
08-17 03:00:31.834  1384  1384 I SysRq   : Show Blocked State
08-17 03:00:31.843  1384  1384 I SysRq   : Show backtrace of all active CPUs

//dropbox (/data/system/dropbox/system_server_watchdog@xxx.txt)
08-17 03:00:31.902   906  3367 I DropBoxManagerService: add tag=system_server_watchdog isTagEnabled=true flags=0x2

@main/system log
08-17 03:00:31.966   906  1384 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg)

08-17 03:00:31.968   906  1384 W Watchdog: android.fg annotated stack trace:
08-17 03:00:31.968   906  1384 W Watchdog:     at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:17574)
08-17 03:00:31.968   906  1384 W Watchdog:     - waiting to lock <0x05fe6da3> (a java.lang.Object)
08-17 03:00:31.968   906  1384 W Watchdog:     at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:233)
08-17 03:00:31.968   906  1384 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:883)
08-17 03:00:31.969   906  1384 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:100)
08-17 03:00:31.969   906  1384 W Watchdog:     at android.os.Looper.loop(Looper.java:214)
08-17 03:00:31.969   906  1384 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
08-17 03:00:31.969   906  1384 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
08-17 03:00:31.969   906  1384 W Watchdog: *** GOODBYE!

//kill self
08-17 03:00:31.969   906  1384 I Process : Sending signal. PID: 906 SIG: 9

//重启LOG
08-17 03:00:33.079  3407  3407 I boot_progress_start: 184709
08-17 03:00:33.635  3407  3407 I boot_progress_preload_start: 185266
08-17 03:00:34.376  3407  3407 I boot_progress_preload_end: 186006
08-17 03:00:34.682  3481  3481 I boot_progress_system_run: 186312
08-17 03:00:35.026  3481  3481 I boot_progress_pms_start: 186656
08-17 03:00:35.227  3481  3481 I boot_progress_pms_system_scan_start: 186857
08-17 03:00:36.049  3481  3481 I boot_progress_pms_data_scan_start: 187679
08-17 03:00:36.056  3481  3481 I boot_progress_pms_scan_end: 187687
08-17 03:00:36.165  3481  3481 I boot_progress_pms_ready: 187795
08-17 03:00:37.424  3481  3481 I boot_progress_ams_ready: 189054
08-17 03:00:38.361  3481  3503 I boot_progress_enable_screen: 189992

//watchdog half 和 watchdog trace表明是"Binder:906_F"线程sleep导致锁无法释放

//watchdog trace
----- pid 906 at 2020-08-17 03:00:21 -----
Cmd line: system_server

"android.fg" prio=5 tid=15 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x147812e8 self=0x7a0bb9cc00
  | sysTid=1378 nice=0 cgrp=default sched=0/0 handle=0x791e0cbd50
  | state=S schedstat=( 293867703 541217342 863 ) utm=23 stm=5 core=1 HZ=100
  | stack=0x791dfc9000-0x791dfcb000 stackSize=1039KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:17574)
  - waiting to lock <0x05fe6da3> (a java.lang.Object) held by thread 117
  at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:233)

//"Binder:906_F"线程(906:2913),sleep导致了Blocked
"Binder:906_F" prio=5 tid=117 Sleeping
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13900728 self=0x797b178000
  | sysTid=2913 nice=0 cgrp=default sched=0/0 handle=0x78c81e9d50
  | state=S schedstat=( 227062849 166952101 637 ) utm=17 stm=4 core=2 HZ=100
  | stack=0x78c80f3000-0x78c80f5000 stackSize=991KB
  | held mutexes=
  at java.lang.Thread.sleep(Native method)
  - sleeping on <0x03e0cbb8> (a java.lang.Object)
  at java.lang.Thread.sleep(Thread.java:440)
  - locked <0x03e0cbb8> (a java.lang.Object)
  at java.lang.Thread.sleep(Thread.java:356)
  at com.android.server.am.ActivityManagerService.testWatchdog(ActivityManagerService.java:2916)
  - locked <0x05fe6da3> (a java.lang.Object)
  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:4506)
  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2744)
  at android.os.Binder.execTransactInternal(Binder.java:1021)
  at android.os.Binder.execTransact(Binder.java:994)
//Binder Client端是TWatchdogTest(pid:3116 tid:3146)
08-17 03:00:32.208  3116  3146 W System.err: android.os.DeadObjectException
08-17 03:00:32.207   301   301 I binder  : release 906:2913 transaction 53556 in, still active
08-17 03:00:32.207   301   301 I binder  : send failed reply for transaction 53556 to 3116:3146
08-17 03:00:32.208  3116  3146 W System.err: 	at android.os.BinderProxy.transactNative(Native Method)
08-17 03:00:32.208  3116  3146 W System.err: 	at android.os.BinderProxy.transact(BinderProxy.java:510)
08-17 03:00:32.208  3116  3146 W System.err: 	at android.app.IActivityManager$Stub$Proxy.testWatchdog(IActivityManager.java:9331)
08-17 03:00:32.208  3116  3146 W System.err: 	at com.example.serviceclient.MainActivity$5.run(MainActivity.java:113)
08-17 03:00:32.208  3116  3146 W System.err: 	at java.lang.Thread.run(Thread.java:919)

08-17 02:58:58.155  3116  3146 D TWatchdogTest: startThread Client .

代码

明显是testWatchdog()sleep太久,导致mobj无法释放,超过60s,最后触发Watchdog
@ActivityManagerService.java

private Object mobj = new Object();

@Override
public void testWatchdog() {
    synchronized (mobj) {
        Slog.d(TAG, "testWatchdog .");
        try {
            Thread.sleep(1000*1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        Slog.d(TAG, "startThread .");
    }
}

//AMS monitor添加mobj锁检测
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
    synchronized (this) { }
    synchronized (mobj) {
        Slog.d(TAG, "testWatchdog monitor() .");
    }
}
@core/java/android/app/IActivityManager.aidl
void testWatchdog();
//MainActivity
public void startThread(String name) {
    Thread mThread = new Thread(new Runnable() {
        @Override
        public void run() {
            Log.d(TAG, "startThread Client .");
            try {
                ActivityManager.getService().testWatchdog();
            } catch (RemoteException e) {
                e.printStackTrace();
            }
        }
    });
    mThread.setName(name);
    mThread.start();
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值