应用与系统稳定性第五篇---Watchdog原理和问题分析

前面已经这个系列已经更新了4篇,死机重启问题分析中,Watchdog问题最为常见,今天接着写一写Watchdog问题的分析套路以及工作原理。
应用与系统稳定性第一篇---ANR问题分析的一般套路
应用与系统稳定性第二篇---ANR的监测与信息采集
应用与系统稳定性第三篇---FD泄露问题漫谈
应用与系统稳定性第四篇---单线程导致的空指针问题分析

一、Watchdog基本认识
1、什么是watchdog?

Watchdog又名看门狗,如果不按时给“喂狗”,超过一分钟,就会咬人。Android系统中,服务有上百种,为了防止SystemServer的一些核心服务hang住而发生冻屏,引入了Watchdog机制,当出现故障时,Watchdog就会调用Process.killProcess(Process.myPid())杀死SystemServer进程system_server进程是zygote的大弟子,是zygote进程fork的第一个进程,zygote和system_server这两个进程可以说是Java世界的半边天,任何一个进程的死亡,都会导致Java世界的崩溃。所以如果子进程SystemServer挂了,Zygote就会自杀,这样Zygote孵化的所有子进程都会重启一遍,相当于手机被软重启了,用户不会因为手机冻屏而不能使用。

上面说的是防止Watchdog问题,系统的处理策略,而我们程序员关注的是,具体是哪里发生了Watchdog,和ANR类似,Watchdog发生过程中,需要dump trace,最终定位并解决问题。所以得研究一套机制能确定超时问题。

watchdog代码位于 /frameworks/base/services/core/java/com/android/server/Watchdog.java

常见Log有下面两种,一种是Blocked in handler 、另外一种是: Blocked in monitor,区别在下文分析。

11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui)
11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace:
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.nativePollOnce(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.next(MessageQueue.java:323)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.Looper.loop(Looper.java:142)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:377)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:239)
11-15 06:56:39.696 24203 24902 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791)
11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace:
......
10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!
2、初始化
1836169-817a961dca395baf.png
watchdog初始化.png

Watchdog本身继承Thread,初始化是在SystemServer启动过程中

public final class SystemServer {
  ... ...
    /**
     * Starts a miscellaneous grab bag of stuff that has yet to be refactored
     * and organized.
     */
    private void startOtherServices() {
    ......
        try {
          ......
            traceBeginAndSlog("InitWatchdog");
            final Watchdog watchdog = Watchdog.getInstance(); // 获取Watchdog对象初始化
            watchdog.init(context, mActivityManagerService); // 注册receiver以接收系统重启广播
            Trace.traceEnd(Trace.TRACE_TAG_SYSTEM_SERVER);
          ......
        }
         ......
        mActivityManagerService.systemReady(new Runnable() {
            @Override
            public void run() {
              ......
                Watchdog.getInstance().start();
              ......
             }
        });
    }

241    public static Watchdog getInstance() {
242        if (sWatchdog == null) {
243            sWatchdog = new Watchdog();
244        }
245
246        return sWatchdog;
247    }

为了搞一套超时判断的方案,在Watchdog在构造函数中,会构建很多HandlerChecker,可以分为两类:

  • Monitor Checker,用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
  • Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,ui, Io, display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS, PKMS,这些是在对应的对象初始化时加入的。
  /* This handler will be used to post message back onto the main thread */
107    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();

249    private Watchdog() {
    //实质调用的是父类Thread的构造方法,设置线程名称
250        super("watchdog");
251        // Initialize handler checkers for each common thread we want to check.  Note
252        // that we are not currently checking the background thread, since it can
253        // potentially hold longer running operations with no guarantees about the timeliness
254        // of operations there.
255
256        // The shared foreground thread is the main checker.  It is where we
257        // will also dispatch monitor checks and do other work.
258        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
259                "foreground thread", DEFAULT_TIMEOUT);
260        mHandlerCheckers.add(mMonitorChecker);
261        // Add checker for main thread.  We only do a quick check since there
262        // can be UI running on the thread.
263        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
264                "main thread", DEFAULT_TIMEOUT));
265        // Add checker for shared UI thread.
266        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
267                "ui thread", DEFAULT_TIMEOUT));
268        // And also check IO thread.
269        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
270                "i/o thread", DEFAULT_TIMEOUT));
271        // And the display thread.
272        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
273                "display thread", DEFAULT_TIMEOUT));
274
275        // Initialize monitor for Binder threads.
276        addMonitor(new BinderThreadMonitor());
277        //O上新增对FD泄露的监控
278        mOpenFdMonitor = OpenFdMonitor.create();
......
283    }

其中DEFAULT_TIMEOUT一般是一分钟,对于installd是10分钟。
两类HandlerChecker的侧重点不同,

Monitor Checker预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多函数的运行;
Looper Checker预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理。

所以Watchdog就靠这两个Checker来搞搞事情了。

3、基本原理
3.1如何添加Checker对象

拿AMS举例,是既添加了Monitor Checker对象,也添加了Looper Checker对象,也实现了Watchdog.Monitor接口,重写了monitor方法。

public class ActivityManagerService extends IActivityManager.Stub
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
  ......
    public ActivityMa
  • 2
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值