Android 深入理解 Watchdog的原理

最新推荐文章于 2024-07-28 19:12:58 发布

Bill_xiao

最新推荐文章于 2024-07-28 19:12:58 发布

阅读量1.5k

点赞数

分类专栏： DEBGU

本文链接：https://blog.csdn.net/Bill_xiao/article/details/115525772

版权

android Watchdog

DEBGU 专栏收录该内容

6 篇文章 3 订阅

订阅专栏

带着问题去阅读源码是最有效的！

一、概述：

1.1

本文围绕以下几个问题点来学习Watchdog：1.Watchdog的工作原理是什么？

2.发生了Watchdog后系统会做什么？有哪些关键的打印信息？

1.2 Watchdog是作用

Android系统中，有硬件WatchDog用于定时检测关键硬件是否正常工作，类似地，在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件。WatchDog功能主要是分析系统核心服务和重要线程是否处于Blocked状态。

二、源代码，Watchdog的工作原理

2.1 Watchdog的启动

SystemServer.java

   private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
705          t.traceBegin("startBootstrapServices");
706  
707          // Start the watchdog as early as possible so we can crash the system server
708          // if we deadlock during early boot
709          t.traceBegin("StartWatchdog");
710          final Watchdog watchdog = Watchdog.getInstance();
711          watchdog.start();
712          t.traceEnd();

2.2 getInstance

 public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }

2.3 Watchdog

 private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        // And the animation thread.
        mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
                "animation thread", DEFAULT_TIMEOUT));
        // And the surface animation thread.
        mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
                "surface animation thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());//放在后台线程做监听

        mOpenFdMonitor = OpenFdMonitor.create();

        mInterestingJavaPids.add(Process.myPid());

        // See the notes on DEFAULT_TIMEOUT.
        assert DB ||
                DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
    }

上面创建了FgThread，getMainLooper，UiThread，IoThread，DisplayThread，AnimationThread，SurfaceAnimationThread 这些HandlerChecker，HandlerChecker是Runnable类型。这些HandlerChecker类型被添加到mHandlerCheckers。

2.3.1 HandlerChecker

   public final class HandlerChecker implements Runnable {
        private final Handler mHandler;
        private final String mName;//线程名称
        private final long mWaitMax;//等待的最长时间，正常是60s
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
        private boolean mCompleted;//初始化的是是true
        private Monitor mCurrentMonitor;
        private long mStartTime; //开始准备检查的时间点
        private int mPauseCount;

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

        void addMonitorLocked(Monitor monitor) {
            // We don't want to update mMonitors when the Handler is in the middle of checking
            // all monitors. We will update mMonitors on the next schedule if it is safe
            mMonitorQueue.add(monitor);//添加到列队中
        }

private static final class BinderThreadMonitor implements Watchdog.Monitor {
    public void monitor() {
        Binder.blockUntilThreadAvailable();
    }
}

2.3.2 IPCThreadState.cpp

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        //等待正在执行的binder线程小于进程最大binder线程上限(16个)
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

后面会讲到watchdog的工作原理，这边就提前解释一下。在run 方法中执行mCurrentMonitor.monitor()，monitor就是每个服务(比如 AMS,PMS 等)实现的监听方法，对于binder来说就是blockUntilThreadAvailable方法的调用。当执行这个方法时候发生了卡顿，比如binder线程卡顿或者binder线程的数量大于16，需要等待系统释放其他的binder线程，那么就有可能发生响应超时的情况。Watchdog就会判断系统卡顿。

2.4 run

 @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final List<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
//遍历所有的添加的hanlercheck，并且会执行scheduleChecklocked 方法，这个方法是Watchdog 核心方法，见2.4.1
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {//这个while的循环的意义是保证等待的时间是超过30s
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout);
                        // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                boolean fdLimitTriggered = false;
                if (mOpenFdMonitor != null) {//监听fd leek 的错误。
                    fdLimitTriggered = mOpenFdMonitor.monitor(); 这个判断依据是/proc/self/fd/1012 这个文件。
                }

                if (!fdLimitTriggered) {//一般情况是false ，只有发生了fd leak的时候才是true
				
					这个检查状态是依据此刻的时间与mStartTime的差值，并且与 mWaitMax/2比较，也就是判断是否大于30s还是小于30s
					COMPLETED = 0：等待完成；
					WAITING = 1：等待时间小于DEFAULT_TIMEOUT的一半，即30s；
					WAITED_HALF = 2：等待时间处于30s~60s之间；
					OVERDUE = 3：等待时间大于或等于60s。
					
                    final int waitState = evaluateCheckerCompletionLocked();//见2.4.4
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    } else if (waitState == WAITING) {
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    } else if (waitState == WAITED_HALF) {
                        if (!waitedHalf) {
                            Slog.i(TAG, "WAITED_HALF");
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
							//第一次超过30s就是打印信息。
                            ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
                            ActivityManagerService.dumpStackTraces(pids, null, null,
                                    getInterestingNativePids(), null);
                            waitedHalf = true;
                        }
                        continue;
                    }

                    // something is overdue!
					//获取被阻塞的模块 ，判断依据是以是否超过一分钟来判断
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);收集所有的卡顿的describeBlockedStateLocked的打印信息
                } else {
                    blockedCheckers = Collections.emptyList();
                    subject = "Open FD high water mark reached";
                }
                allowRestart = mAllowRestart;//如果是false的话，就不会重启。
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);//所以一旦发生了watchdog的问题 在event log 中能找到相关的打印

            ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);

            long anrTime = SystemClock.uptimeMillis();
            StringBuilder report = new StringBuilder();
            report.append(MemoryPressureUtil.currentPsiState());//添加/proc/pressure/memory节点的信息
            ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
            StringWriter tracesFileException = new StringWriter();
			//第二次以追加的方式，输出system_server和3个native进程的栈信息
            final File stack = ActivityManagerService.dumpStackTraces(
                    pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
                    tracesFileException);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(5000);//系统已被阻塞1分钟，也不在乎多等待5s，来确保stack trace信息输出

            processCpuTracker.update();
            report.append(processCpuTracker.printCurrentState(anrTime));
            report.append(tracesFileException.getBuffer());

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
			///proc/sysrq-trigger 触发kernel输出打印信息
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        // If a watched thread hangs before init() is called, we don't have a
                        // valid mActivity. So we can't log the error to dropbox.
                        if (mActivity != null) {
                            mActivity.addErrorToDropBox(
                                    "watchdog", null, "system_server", null, null, null,
                                    subject, report.toString(), stack, null);
                        }
                        FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
                                subject);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
				//将阻塞状态报告给activity controller
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
					 //返回值为1表示继续等待，-1表示杀死系统
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                Slog.w(TAG, "*** GOODBYE!");
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

2.4.1 scheduleCheckLocked

	public void scheduleCheckLocked() {
		//第一次调用的时候，清空mMonitorQueue，并且把mMonitorQueue里面的实例添加到mMonitors。其实mMonitorQueue每次添加的时候就只有一个。
            if (mCompleted) {
                // Safe to update monitors in queue, Handler is not in the middle of work
                mMonitors.addAll(mMonitorQueue);
                mMonitorQueue.clear();
            }
			
			//如果monitoers 是空的话或者//当目标looper正在轮询状态则返回
            if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
                    || (mPauseCount > 0)) {
                // Don't schedule until after resume OR
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread. Note that we
                // only do this if we have no monitors since those would need to
                // be executed at this point.
                mCompleted = true;
                return;
            }
            if (!mCompleted) {//如果之前已经检查过了，就不必要重新再设置时间，发生消息
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();//创建监听开始的时间。
			//发生消息，是把自身加入消息列队中。mHandler的值有：FgThread.getHandler() ，UiThread.getHandler()，IoThread.getHandler()，
			//所以一旦这些handler 执行的任务超时或者本身线程卡顿都是有可能产生SWT。所以在执行describeBlockedStateLocked打印信息的时候引入一个mCurrentMonitor
			//来作为区分。mCurrentMonitor的逻辑看run方法就比较清晰。大体的意思就是，如果handler执行到run方法，那么mCurrentMonitor不为空，而此时发生了watchdog 的Error。
			//那就是执行monitor卡顿，而这个monitor我们需要监听服务(比如AMS PMS等)内部方法。mCurrentMonitor==null 那么就是说明handler线程还没有执行run方法时就发生了SWT。
            mHandler.postAtFrontOfQueue(this); 
        }

2.4.2 describeBlockedStateLocked

 String describeBlockedStateLocked() {
            if (mCurrentMonitor == null) {
                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
            } else {
                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                        + " on " + mName + " (" + getThread().getName() + ")";
            }
        }

2.4.3 run

  public void run() {
            // Once we get here, we ensure that mMonitors does not change even if we call
            // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
            // move them to mMonitors on the next schedule when mCompleted is true, at which
            // point we have completed execution of this method.
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {//保证线程的同步问题，并且只要执行的服务卡顿了，后续代码就无法执行。
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();//调用添加的服务自身的方法，比如我们之前介绍的binder 的 blockUntilThreadAvailable
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;//这个置空的目的是在执行describeBlockedStateLocked打印的时候能比较清晰的知道，到底是哪个地方卡住了，是执行的handler线程还是服务。
            }
        }

2.4.4 evaluateCheckerCompletionLocked

  private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            state = Math.max(state, hc.getCompletionStateLocked());
        }
        return state;
    }

    private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
        ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            if (hc.isOverdueLocked()) {
                checkers.add(hc);
            }
        }
        return checkers;
    }

到这里整个Watchdog的工作原理就讲解完了。

总结一下：1. 启动所有的 HandlerChecker.scheduleCheckLocked()，获取Watchdog 监听开始的时间：mStartTime = SystemClock.uptimeMillis，这个开始时间很重要，是否发生了Watchdog的错误就是以这个时间作为标准。 mHandler.postAtFrontOfQueue(this) 会把自身添加的handler 列队中等待执行。在这个等待执行的过程中有可能会超时，所以在describeBlockedStateLocked的打印信息中有做区分。最后就会执行mCurrentMonitor.monitor()。这个方法就为了检查添加的服务是否卡顿的。这就是Watchdog运行的核心逻辑。

2.evaluateCheckerCompletionLocked获取所有的HandlerChecker.getCompletionStateLocked的状态，而状态的依据是SystemClock.uptimeMillis() > mStartTime + mWaitMax 。也就是差值跟30s比较，

COMPLETED = 0：等待完成；
                   WAITING = 1：等待时间小于DEFAULT_TIMEOUT的一半，即30s；
                   WAITED_HALF = 2：等待时间处于30s~60s之间；
                   OVERDUE = 3：等待时间大于或等于60s。

如果waitState=WAITED_HALF，就会打印一次track信息 ActivityManagerService.dumpStackTraces(pids, null, null, getInterestingNativePids(), null);

如果waitState=OVERDUE EventLog.writeEvent(EventLogTags.WATCHDOG, subject)和ActivityManagerService.dumpStackTraces。并且还会把相关的log输出到kernel log中doSysRq('w')，doSysRq('l');以及 mActivity.addErrorToDropBox

3.如果有设置重启标识符allowRestart 就会重启 Process.killProcess(Process.myPid());System.exit(10);，并且推出进程。

所以Watchdog的工作原理我们就介绍完成了

三、发生了Watchdog后系统会做什么？有哪些关键的打印信息

1.如果是超时30s，会有一个track打印。 ActivityManagerService.dumpStackTraces，主要是一些堆栈的信息，会被输出到data/anr/traces.txt

 private static synchronized File createAnrDumpFile(File tracesDir) throws IOException {
4019          if (sAnrFileDateFormat == null) {
4020              sAnrFileDateFormat = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss-SSS");
4021          }
4022  
4023          final String formattedDate = sAnrFileDateFormat.format(new Date());
4024          final File anrFile = new File(tracesDir, ANR_FILE_PREFIX + formattedDate);
4025  
4026          if (anrFile.createNewFile()) {
4027              FileUtils.setPermissions(anrFile.getAbsolutePath(), 0600, -1, -1); // -rw-------
4028              return anrFile;
4029          } else {
4030              throw new IOException("Unable to create ANR dump file: createNewFile failed");
4031          }
4032      }

2.如果是超过 60s ，会再一次 ActivityManagerService.dumpStackTraces，以及 EventLog.writeEvent。还会触发kernel的信息打印 doSysRq('w'); doSysRq('l');

 private void doSysRq(char c) {
723          try {
724              FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
725              sysrq_trigger.write(c);
726              sysrq_trigger.close();
727          } catch (IOException e) {
728              Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
729          }
730      }

mActivity.addErrorToDropBox和 FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,subject);

   if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                Slog.w(TAG, "*** GOODBYE!");
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

发生了Watchdog是否重启依据allowRestart，这个值是由

 public void setAllowRestart(boolean allowRestart) {
        synchronized (this) {
            mAllowRestart = allowRestart;
        }
    }

而这个方法的调用是由AMS调用的。

 public void hang(final IBinder who, boolean allowRestart) {
9292          if (checkCallingPermission(android.Manifest.permission.SET_ACTIVITY_WATCHER)
9293                  != PackageManager.PERMISSION_GRANTED) {
9294              throw new SecurityException("Requires permission "
9295                      + android.Manifest.permission.SET_ACTIVITY_WATCHER);
9296          }
9297  
9298          final IBinder.DeathRecipient death = new DeathRecipient() {
9299              @Override
9300              public void binderDied() {
9301                  synchronized (this) {
9302                      notifyAll();
9303                  }
9304              }
9305          };
9306  
9307          try {
9308              who.linkToDeath(death, 0);
9309          } catch (RemoteException e) {
9310              Slog.w(TAG, "hang: given caller IBinder is already dead.");
9311              return;
9312          }
9313  
9314          synchronized (this) {
9315              Watchdog.getInstance().setAllowRestart(allowRestart);
9316              Slog.i(TAG, "Hanging system process at request of pid " + Binder.getCallingPid());
9317              synchronized (death) {
9318                  while (who.isBinderAlive()) {
9319                      try {
9320                          death.wait();
9321                      } catch (InterruptedException e) {
9322                      }
9323                  }
9324              }
9325              Watchdog.getInstance().setAllowRestart(true);
9326          }
9327      }

binder进程死掉就会触发