Android Watchdog机制

最新推荐文章于 2024-06-20 16:20:13 发布

kaijiehui

最新推荐文章于 2024-06-20 16:20:13 发布

阅读量1.2k

点赞数

分类专栏： framework 文章标签： WatchDog

framework 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

Android的SystemServer是一个非常复杂的进程，里面运行的服务超过五十种，是最可能出问题的进程，因此有必要对SystemServer中运行的各种线程实施监控。但是如果使用硬件看门狗的工作方式，每个线程隔一段时间去喂狗，不但非常浪费CPU，而且会导致程序设计更加复杂。因此Android开发了WatchDog类作为软件看门狗来监控SystemServer中的线程。一旦发现问题，WatchDog会杀死SystemServer进程。
SystemServer的父进程Zygote接收到SystemServer的死亡信号后，会杀死自己。Zygote进程死亡的信号传递到Init进程后，Init进程会杀死Zygote进程所有的子进程并重启Zygote。这样整个手机相当于重启一遍。通常SystemServer出现问题和kernel并没有关系，所以这种“软重启”大部分时候都能够解决问题。而且这种“软重启”的速度更快，对用户的影响也更小。

WatchDog是在SystemServer进程中被初始化和启动的。在SystemServer 的run方法中，各种Android服务被注册和启动，其中也包括了WatchDog的初始化和启动。代码如下：

[java]view plain copy
final Watchdog watchdog = Watchdog.getInstance();  
watchdog.init(context, mActivityManagerService);  

在SystemServer中startOtherServices的后半段，将通过SystemReady接口通知系统已经就绪。在ActivityManagerService的SystemReady接口的CallBack函数中实现WatchDog的启动

[java]view plain copy
Watchdog.getInstance().start();  

以上代码位于frameworks/base/services/java/com/android/server/SystemServer.java中。
前面说到WatchDog是在SystemServer.java中通过getInstance方法创建的，其具体实现方式如下：

[java]view plain copy
public static Watchdog getInstance() {  
    if (sWatchdog == null) {  
        sWatchdog = new Watchdog();    //单例模式创建实例  
    }  
  
    return sWatchdog;  
}  
  
private Watchdog() {  
    super("watchdog");  
    // Initialize handler checkers for each common thread we want to check.  Note  
    // that we are not currently checking the background thread, since it can  
    // potentially hold longer running operations with no guarantees about the timeliness  
    // of operations there.  
  
    // The shared foreground thread is the main checker.  It is where we  
    // will also dispatch monitor checks and do other work.  
    mMonitorChecker = new HandlerChecker(FgThread.getHandler(),  
            "foreground thread", DEFAULT_TIMEOUT);  
    mHandlerCheckers.add(mMonitorChecker);  
    // Add checker for main thread.  We only do a quick check since there  
    // can be UI running on the thread.  
    mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),  
            "main thread", DEFAULT_TIMEOUT));  
    // Add checker for shared UI thread.  
    mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),  
            "ui thread", DEFAULT_TIMEOUT));  
    // And also check IO thread.  
    mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),  
            "i/o thread", DEFAULT_TIMEOUT));  
    // And the display thread.  
    mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),  
            "display thread", DEFAULT_TIMEOUT));  
  
    // Initialize monitor for Binder threads.  
    addMonitor(new BinderThreadMonitor());  
}  

在Watchdog构造函数中将main thread，UIthread，Iothread，DisplayThread加入mHandlerCheckers列表中。最后初始化monitor放入mMonitorCheckers列表中。

[java]view plain copy
public void addMonitor(Monitor monitor) {  
  synchronized (this) {  
      if (isAlive()) {  
          throw new RuntimeException("Monitors can't be added once the Watchdog is running");  
      }  
      mMonitorChecker.addMonitor(monitor);  
  }  

上述代码仅仅是启动了watchdog服务，但watchdog还不知道需要监视哪些系统服务。为保持watchdog模块的独立性和可扩展性，需要由系统服务向watchdog注册。Watchdog提供两种监视方式，一种是通过monitor()回调监视服务关键区是否出现死锁或阻塞，一种是通过发送消息监视服务主线程是否阻塞。
以ActivityManagerService.java为例，为向watchdog注册monitor()回调，首先需要继承watchdog.Monitor接口：

[java]view plain copy
public class ActivityManagerService extends ActivityManagerNativeEx  
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {  

而后在构造函数中把自身注册到watchdog monitor服务中。注意这里有两个检测项，一个是addMonitor，在每一个检测周期中watchdog会使用foreground thread的HandlerChecker回调服务注册的monitor()方法给服务的关键区上锁并马上释放，以检测关键区是否存在死锁或阻塞；另一个是addThread，watchdog会定时通过HandlerChecker向系统服务发送消息，以检测服务主线程是否被阻塞。这就是为什么在watchdog重启时有有两种提示语：“Block in Handler in ......”和“Block in monitor”，它们分别对应不同的阻塞类型。

[java]view plain copy
Watchdog.getInstance().addMonitor(this);  
Watchdog.getInstance().addThread(mHandler);  

最后在类中实现watchdog.Monitor所需的monitor方法。watchdog运行时每30秒会回调这个方法来锁一次这个关键区，如果60秒都无法得到锁，就说明服务已经发生了死锁，必须重启设备。

[java]view plain copy
/** In this method we try to acquire our lock to make sure that we have not deadlocked */  
public void monitor() {  
    synchronized (this) { }  
}  

从上面分析可以知道，在watchdog的构造函数中将foreground thread、mian thread传入了一个HandlerChecker类。这个类就是watchdog检测超时的执行者。HandlerChecker类有多个实例，每个通过addThread向watchdog注册自身的服务都对应一个HandlerChecker类实例。

[java]view plain copy
public void addThread(Handler thread) {  
    addThread(thread, DEFAULT_TIMEOUT);  
}  
  
public void addThread(Handler thread, long timeoutMillis) {  
    synchronized (this) {  
        if (isAlive()) {  
            throw new RuntimeException("Threads can't be added once the Watchdog is running");  
        }  
        final String name = thread.getLooper().getThread().getName();  
        mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));  
    }  
}  

HandlerChecker继承了Runnable，每个HandlerChecker在各自服务的主线程中运行并完成相应的检查，不会互相干扰。

[java]view plain copy
/** 
 * Used for checking status of handle threads and scheduling monitor callbacks. 
 */  
public final class HandlerChecker implements Runnable {  
    private final Handler mHandler;  
    private final String mName;  
    private final long mWaitMax;  
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();  
    private boolean mCompleted;  
    private Monitor mCurrentMonitor;  
    private long mStartTime;  
  
    HandlerChecker(Handler handler, String name, long waitMaxMillis) {  
        mHandler = handler;  
        mName = name;  
        mWaitMax = waitMaxMillis;  
        mCompleted = true;  
    }  

每个通过addThread向watchdog注册自身的服务都对应一个HandlerChecker类实例，那么通过addMonitor()注册的服务由谁来检查呢？答案就是前面出现的mMonitorChecker，也就是foreground thread的HandlerChecker。它除了需要检测主线程是否堵塞外，还需要回调系统服务注册的monitor()方法，以检测这些服务的关键区是否存在死锁或阻塞。
之所以不能在watchdog的主线程中回调monitor()方法，是由于如果被监控服务的关键区被占用，其monitor()方法可能需要一段时间才能返回。这样就无法保证watchdog每次个检测周期都是30s，所以必须交由foreground thread代为检查。
addMonitor()中会把每个monitor添加到mMonitorChecker也就是foreground thread的HandlerChecker中。除了它以外，所有HandlerChecker的mMonitors都是空的。
当watchdog的主循环开始运行后，每隔30秒，都会依次调用所有HandlerChecker的scheduleCheckLocked()方法。对于foreground thread的HandlerChecker，由于它的mMonitors不为空，需要它去锁各服务的monitor()来检查是否出现死锁，因此每个检测周期都要执行它。
对于其他的HandlerChecker，需要判断线程的Looper是否处于Idling，若为空就说明前一个消息已经执行完毕正在等下一个，消息循环肯定没阻塞，不用继续检测直接跳过本轮。
如果线程的消息循环不是Idling状态，说明服务的主线程正在处理某个消息，有阻塞的可能，就需要使用PostAtFrontOfQueue发出消息到消息队列，并记录下当前系统时间，同时将mComplete置为false，标明已经发出一个消息正在等待处理。

[java]view plain copy
public void scheduleCheckLocked() {  
    if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {  
        // If the target looper has recently been polling, then  
        // there is no reason to enqueue our checker on it since that  
        // is as good as it not being deadlocked.  This avoid having  
        // to do a context switch to check the thread.  Note that we  
        // only do this if mCheckReboot is false and we have no  
        // monitors, since those would need to be executed at this point.  
        mCompleted = true;  
        return;  
    }  
  
    if (!mCompleted) {  
        // we already have a check in flight, so no need  
        return;  
    }  
  
    mCompleted = false;  
    mCurrentMonitor = null;  
    mStartTime = SystemClock.uptimeMillis();  
    mHandler.postAtFrontOfQueue(this);  
}  

如果线程的消息队列没有阻塞，PostAtFrontOfQueue很快就会触发HandlerChecker的run方法。对于foreground thread的HandlerChecker，它会回调被监控服务的monitor方法，对其关键区上锁并马上释放，以检查是否存在死锁或阻塞。对于其他线程，仅需要将mComplete标记为true，表明消息已经处理完成即可。

[java]view plain copy
    @Override  
    public void run() {  
        final int size = mMonitors.size();  
        for (int i = 0 ; i < size ; i++) {  
            synchronized (Watchdog.this) {  
                mCurrentMonitor = mMonitors.get(i);  
            }  
            mCurrentMonitor.monitor();  
        }  
  
        synchronized (Watchdog.this) {  
            mCompleted = true;  
            mCurrentMonitor = null;  
        }  
    }  
}  

如果服务的消息循环发生了堵塞，那么mComplete就会一直处于false状态。watchdog在每一个检测周期中都会一次调用每个HandlerChecker的getCompletionStateLocked方法检测超时时间，如果任何一个服务的主线程30s无响应就会提前输出其堆栈为重启做准备，如果60s无响应则进入重启流程。

[java]view plain copy
public int getCompletionStateLocked() {  
    if (mCompleted) {  
        return COMPLETED;  
    } else {  
        long latency = SystemClock.uptimeMillis() - mStartTime;  
        if (latency < mWaitMax/2) {  
            return WAITING;  
        } else if (latency < mWaitMax) {  
            return WAITED_HALF;  
        }  
    }  
    return OVERDUE;  
}  

Watchdog主循环
SystemServer调用watchdog的start方法，watchdog便开始在自己线程的while循环中运行，以达到每30s检测一次的目的：

[java]view plain copy
@Override  
public void run() {  
    boolean waitedHalf = false;  
    while (true) {  
        final ArrayList<HandlerChecker> blockedCheckers;  
        final String subject;  
        final boolean allowRestart;  
        int debuggerWasConnected = 0;  
        synchronized (this) {  
            long timeout = CHECK_INTERVAL;  
            // Make sure we (re)spin the checkers that have become idle within  
            // this wait-and-check interval  
            for (int i=0; i<mHandlerCheckers.size(); i++) {   //遍历各个HandlerChecker，依次检查前台，ui，主线程等系统主要线程  
                HandlerChecker hc = mHandlerCheckers.get(i);  
                hc.scheduleCheckLocked();  
            }  

对于每个检测周期，首先需要将timeout计时器复位，而后依次检查在watchdog的init方法中注册的foreground thread，main thread，UI thread，i/o thread，以及其他通过addThread方法注册的服务的主线程是否阻塞。
检查主线程是否阻塞的方法是，如果线程Looper状态不是Idling，就通过HandlerChecker的postAtFrontOfQueue方法发送一个消息。稍后检测这个消息是否超时未返回。
通过postAtFrontOfQueue送出消息后睡眠30s。注意这里使用uptimeMillis()计算时间，不计手机在睡眠中度过的时间。这是由于手机睡眠时系统服务同样也在睡眠，无法响应watchdog送出的消息，如果把睡眠时间计算在内当手机被再次唤醒时会导致watchdog认为时间已经过去了很久，从而发生误杀。

[java]view plain copy
// NOTE: We use uptimeMillis() here because we do not want to increment the time we  
// wait while asleep. If the device is asleep then the thing that we are waiting  
// to timeout on is asleep as well and won't have a chance to run, causing a false  
// positive on when to kill things.  
long start = SystemClock.uptimeMillis();   //使用uptimeMills不把手机睡眠时间算进入，手机睡眠时系统服务同样睡眠，状态无法响应watchdog会导致误杀  
while (timeout > 0) {  
    if (Debug.isDebuggerConnected()) {  
        debuggerWasConnected = 2;  
    }  
    try {  
        wait(timeout);  
    } catch (InterruptedException e) {  
        Log.wtf(TAG, e);  
    }  
    if (Debug.isDebuggerConnected()) {  
        debuggerWasConnected = 2;  
    }  //CHECK_INTERVAL的默认时间是30s，此为第一次等待时间，WatchDog判断对象是否死锁的最长等待时间为1min  
    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);  
}  

30秒等待完成后，就要检测之前送出的消息是否已经执行完毕。通过evaluateCheckerCompletionLocked遍历所有的HandlerChecker，返回最大的waitState值。waitState共有四种情况：COMPLETED对应消息已处理完毕线程无阻塞；WAITING对应消息处理花费0～29秒，需要继续运行；WAITED_HALF对应消息处理花费30～59秒，线程可能已经被阻塞，需要保存当前AMS堆栈状态，用以在超时发生时输出堆栈；OVERDUE对应消息处理已经花费超过60s，此时便进入下一流程，输出堆栈信息并重启手机。

[java]view plain copy
final int waitState = evaluateCheckerCompletionLocked();  
if (waitState == COMPLETED) {  
    // The monitors have returned; reset  
    waitedHalf = false;   //所有服务都正常，reset  
    continue;  
} else if (waitState == WAITING) {  
    // still waiting but within their configured intervals; back off and recheck  
    continue;  
} else if (waitState == WAITED_HALF) {  
    if (!waitedHalf) {  
        // We've waited half the deadlock-detection interval.  Pull a stack  
        // trace and wait another half.  
        ArrayList<Integer> pids = new ArrayList<Integer>();  
        pids.add(Process.myPid());  
        ActivityManagerService.dumpStackTraces(true, pids, null, null,  
                NATIVE_STACKS_OF_INTEREST);  
        waitedHalf = true;  
    }  
    continue;  
}  

Watchdog超时已经发生，但之前evaluateCheckerCompletionLocked并不关心是哪个服务发生阻塞，仅仅返回所有服务最大的waitState值。此时需要调用getBlockedCheckersLocked判断具体是哪些应用发生了阻塞，阻塞的原因是什么。这就是我们在dropbox中看到的阻塞原因描述。而后依次输出AMS与Kernel调用堆栈。

[java]view plain copy
            // something is overdue!  
            blockedCheckers = getBlockedCheckersLocked();   //WatchDog超时，获取那个服务超时阻塞，生成崩溃描述符  
            subject = describeCheckersLocked(blockedCheckers);  //判断是否重启  
            allowRestart = mAllowRestart;  
        }  
  
        // If we got here, that means that the system is most likely hung.  
        // First collect stack traces from all threads of the system process.  
        // Then kill this process so that the system will restart.  
        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);  
  
        ArrayList<Integer> pids = new ArrayList<Integer>();  
        pids.add(Process.myPid());  
        if (mPhonePid > 0) pids.add(mPhonePid);  
        // Pass !waitedHalf so that just in case we somehow wind up here without having  
        // dumped the halfway stacks, we properly re-initialize the trace file.  
        final File stack = ActivityManagerService.dumpStackTraces(  
                !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);  
  
        // Give some extra time to make sure the stack traces get written.  
        // The system's been hanging for a minute, another second or two won't hurt much.  
        SystemClock.sleep(2000);  
  
        // Pull our own kernel thread stacks as well if we're configured for that  
        if (RECORD_KERNEL_THREADS) {  
            dumpKernelStackTraces();  
        }  
  
        // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log  
        doSysRq('w');  
        doSysRq('l');  
  
        // Try to add the error to the dropbox, but assuming that the ActivityManager  
        // itself may be deadlocked.  (which has happened, causing this statement to  
        // deadlock and the watchdog as a whole to be ineffective)  
        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {  
                public void run() {  
                    mActivity.addErrorToDropBox(  
                            "watchdog", null, "system_server", null, null,  
                            subject, null, stack, null);  
                }  
            };  
        dropboxThread.start();  
        try {  
            dropboxThread.join(2000);  // wait up to 2 seconds for it to return.  
        } catch (InterruptedException ignored) {}  
  
        IActivityController controller;  
        synchronized (this) {  
            controller = mController;  
        }  
        if (controller != null) {  
            Slog.i(TAG, "Reporting stuck state to activity controller");  
            try {  
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");  
                // 1 = keep waiting, -1 = kill system  
                int res = controller.systemNotResponding(subject);  
                if (res >= 0) {  
                    Slog.i(TAG, "Activity controller requested to coninue to wait");  
                    waitedHalf = false;  
                    continue;  
                }  
            } catch (RemoteException e) {  
            }  
        }  
  
        // Only kill the process if the debugger is not attached.  
        if (Debug.isDebuggerConnected()) {  
            debuggerWasConnected = 2;  
        }  
        if (debuggerWasConnected >= 2) {  
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");  
        } else if (debuggerWasConnected > 0) {  
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");  
        } else if (!allowRestart) {  
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");  
        } else {  
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);  
            for (int i=0; i<blockedCheckers.size(); i++) {  
                Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");  
                StackTraceElement[] stackTrace  
                        = blockedCheckers.get(i).getThread().getStackTrace();  
                for (StackTraceElement element: stackTrace) {  
                    Slog.w(TAG, "    at " + element);  
                }  
            }  
            Slog.w(TAG, "*** GOODBYE!");  
            Process.killProcess(Process.myPid());  
            System.exit(10);  
        }  
  
        waitedHalf = false;  
    }  
}  

输出dropbox，并检查activity controller连接的调试器是否可以处理这次watchdog无响应，如果activity controller不要求重启，那么就忽视这次超时，从头继续运行watchdog循环。杀死SystemServer并重启手机。

kaijiehui

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Android Watchdog机制

Android的SystemServer是一个非常复杂的进程，里面运行的服务超过五十种，是最可能出问题的进程，因此有必要对SystemServer中运行的各种线程实施监控。但是如果使用硬件看门狗的工作方式，每个线程隔一段时间去喂狗，不但非常浪费CPU，而且会导致程序设计更加复杂。因此Android开发了WatchDog类作为软件看门狗来监控SystemServer中的线程。一旦发现问题，Watch...
复制链接

扫一扫

专栏目录