带着问题去阅读源码是最有效的!
一、概述:
1.1
本文围绕以下几个问题点来学习Watchdog:1.Watchdog的工作原理是什么?
2.发生了Watchdog后系统会做什么?有哪些关键的打印信息?
1.2 Watchdog是作用
Android系统中,有硬件WatchDog用于定时检测关键硬件是否正常工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件。WatchDog功能主要是分析系统核心服务和重要线程是否处于Blocked状态。
二、源代码,Watchdog的工作原理
2.1 Watchdog的启动
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
705 t.traceBegin("startBootstrapServices");
706
707 // Start the watchdog as early as possible so we can crash the system server
708 // if we deadlock during early boot
709 t.traceBegin("StartWatchdog");
710 final Watchdog watchdog = Watchdog.getInstance();
711 watchdog.start();
712 t.traceEnd();
2.2 getInstance
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
2.3 Watchdog
private Watchdog() {
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// And the animation thread.
mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
"animation thread", DEFAULT_TIMEOUT));
// And the surface animation thread.
mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());//放在后台线程做监听
mOpenFdMonitor = OpenFdMonitor.create();
mInterestingJavaPids.add(Process.myPid());
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
上面创建了FgThread,getMainLooper,UiThread,IoThread,DisplayThread,AnimationThread,SurfaceAnimationThread 这些HandlerChecker,HandlerChecker是Runnable类型。这些HandlerChecker类型被添加到mHandlerCheckers。
2.3.1 HandlerChecker
public final class HandlerChecker implements Runnable {
private final Handler mHandler;
private final String mName;//线程名称
private final long mWaitMax;//等待的最长时间,正常是60s
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
private boolean mCompleted;//初始化的是是true
private Monitor mCurrentMonitor;
private long mStartTime; //开始准备检查的时间点
private int mPauseCount;
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
void addMonitorLocked(Monitor monitor) {
// We don't want to update mMonitors when the Handler is in the middle of checking
// all monitors. We will update mMonitors on the next schedule if it is safe
mMonitorQueue.add(monitor);//添加到列队中
}
private static final class BinderThreadMonitor implements Watchdog.Monitor {
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
2.3.2 IPCThreadState.cpp
void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
//等待正在执行的binder线程小于进程最大binder线程上限(16个)
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
后面会讲到watchdog的工作原理,这边就提前解释一下。在run 方法中执行mCurrentMonitor.monitor(),monitor就是每个服务(比如 AMS,PMS 等)实现的监听方法,对于binder来说就是blockUntilThreadAvailable方法的调用。当执行这个方法时候发生了卡顿,比如binder线程卡顿或者binder线程的数量大于16,需要等待系统释放其他的binder线程,那么就有可能发生响应超时的情况。Watchdog就会判断系统卡顿。
2.4 run
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final List<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
//遍历所有的添加的hanlercheck,并且会执行scheduleChecklocked 方法,这个方法是Watchdog 核心方法,见2.4.1
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
while (timeout > 0) {//这个while的循环的意义是保证等待的时间是超过30s
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
boolean fdLimitTriggered = false;
if (mOpenFdMonitor != null) {//监听fd leek 的错误。
fdLimitTriggered = mOpenFdMonitor.monitor(); 这个判断依据是/proc/self/fd/1012 这个文件。
}
if (!fdLimitTriggered) {//一般情况是false ,只有发生了fd leak的时候才是true
这个检查状态是依据此刻的时间与mStartTime的差值,并且与 mWaitMax/2比较,也就是判断是否大于30s还是小于30s
COMPLETED = 0:等待完成;
WAITING = 1:等待时间小于DEFAULT_TIMEOUT的一半,即30s;
WAITED_HALF = 2:等待时间处于30s~60s之间;
OVERDUE = 3:等待时间大于或等于60s。
final int waitState = evaluateCheckerCompletionLocked();//见2.4.4
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
Slog.i(TAG, "WAITED_HALF");
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
//第一次超过30s就是打印信息。
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
ActivityManagerService.dumpStackTraces(pids, null, null,
getInterestingNativePids(), null);
waitedHalf = true;
}
continue;
}
// something is overdue!
//获取被阻塞的模块 ,判断依据是以是否超过一分钟来判断
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);收集所有的卡顿的describeBlockedStateLocked的打印信息
} else {
blockedCheckers = Collections.emptyList();
subject = "Open FD high water mark reached";
}
allowRestart = mAllowRestart;//如果是false的话,就不会重启。
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);//所以一旦发生了watchdog的问题 在event log 中能找到相关的打印
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(MemoryPressureUtil.currentPsiState());//添加/proc/pressure/memory节点的信息
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
//第二次以追加的方式,输出system_server和3个native进程的栈信息
final File stack = ActivityManagerService.dumpStackTraces(
pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
tracesFileException);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);//系统已被阻塞1分钟,也不在乎多等待5s,来确保stack trace信息输出
processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
///proc/sysrq-trigger 触发kernel输出打印信息
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
// If a watched thread hangs before init() is called, we don't have a
// valid mActivity. So we can't log the error to dropbox.
if (mActivity != null) {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null, null,
subject, report.toString(), stack, null);
}
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
subject);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
//将阻塞状态报告给activity controller
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
//返回值为1表示继续等待,-1表示杀死系统
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
2.4.1 scheduleCheckLocked
public void scheduleCheckLocked() {
//第一次调用的时候,清空mMonitorQueue,并且把mMonitorQueue里面的实例添加到mMonitors。其实mMonitorQueue每次添加的时候就只有一个。
if (mCompleted) {
// Safe to update monitors in queue, Handler is not in the middle of work
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
//如果monitoers 是空的话或者//当目标looper正在轮询状态则返回
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) {
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {//如果之前已经检查过了,就不必要重新再设置时间,发生消息
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();//创建监听开始的时间。
//发生消息,是把自身加入消息列队中。mHandler的值有:FgThread.getHandler() ,UiThread.getHandler(),IoThread.getHandler(),
//所以一旦这些handler 执行的任务超时或者本身线程卡顿都是有可能产生SWT。所以在执行describeBlockedStateLocked打印信息的时候引入一个mCurrentMonitor
//来作为区分。mCurrentMonitor的逻辑看run方法就比较清晰。大体的意思就是,如果handler执行到run方法,那么mCurrentMonitor不为空,而此时发生了watchdog 的Error。
//那就是执行monitor卡顿,而这个monitor我们需要监听服务(比如AMS PMS等)内部方法。mCurrentMonitor==null 那么就是说明handler线程还没有执行run方法时就发生了SWT。
mHandler.postAtFrontOfQueue(this);
}
2.4.2 describeBlockedStateLocked
String describeBlockedStateLocked() {
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
2.4.3 run
public void run() {
// Once we get here, we ensure that mMonitors does not change even if we call
// #addMonitorLocked because we first add the new monitors to mMonitorQueue and
// move them to mMonitors on the next schedule when mCompleted is true, at which
// point we have completed execution of this method.
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {//保证线程的同步问题,并且只要执行的服务卡顿了,后续代码就无法执行。
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();//调用添加的服务自身的方法,比如我们之前介绍的binder 的 blockUntilThreadAvailable
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;//这个置空的目的是在执行describeBlockedStateLocked打印的时候能比较清晰的知道,到底是哪个地方卡住了,是执行的handler线程还是服务。
}
}
2.4.4 evaluateCheckerCompletionLocked
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
if (hc.isOverdueLocked()) {
checkers.add(hc);
}
}
return checkers;
}
到这里整个Watchdog的工作原理就讲解完了。
总结一下:1. 启动所有的 HandlerChecker.scheduleCheckLocked(), 获取Watchdog 监听开始的时间:mStartTime = SystemClock.uptimeMillis,这个开始时间很重要,是否发生了Watchdog的错误就是以这个时间作为标准。 mHandler.postAtFrontOfQueue(this) 会把自身添加的handler 列队中等待执行。在这个等待执行的过程中有可能会超时,所以在describeBlockedStateLocked的打印信息中有做区分。最后就会执行mCurrentMonitor.monitor()。这个方法就为了检查添加的服务是否卡顿的。这就是Watchdog运行的核心逻辑。
2.evaluateCheckerCompletionLocked获取所有的HandlerChecker.getCompletionStateLocked的状态,而状态的依据是SystemClock.uptimeMillis() > mStartTime + mWaitMax 。也就是差值跟30s比较,
COMPLETED = 0:等待完成;
WAITING = 1:等待时间小于DEFAULT_TIMEOUT的一半,即30s;
WAITED_HALF = 2:等待时间处于30s~60s之间;
OVERDUE = 3:等待时间大于或等于60s。
如果waitState=WAITED_HALF,就会打印一次track信息 ActivityManagerService.dumpStackTraces(pids, null, null, getInterestingNativePids(), null);
如果waitState=OVERDUE EventLog.writeEvent(EventLogTags.WATCHDOG, subject)和ActivityManagerService.dumpStackTraces。并且还会把相关的log输出到kernel log中doSysRq('w'),doSysRq('l');以及 mActivity.addErrorToDropBox
3.如果有设置重启标识符allowRestart 就会重启 Process.killProcess(Process.myPid());System.exit(10);,并且推出进程。
所以Watchdog的工作原理我们就介绍完成了
三、发生了Watchdog后系统会做什么?有哪些关键的打印信息
1.如果是超时30s,会有一个track打印。 ActivityManagerService.dumpStackTraces,主要是一些堆 栈的信息,会被输出到data/anr/traces.txt
private static synchronized File createAnrDumpFile(File tracesDir) throws IOException {
4019 if (sAnrFileDateFormat == null) {
4020 sAnrFileDateFormat = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss-SSS");
4021 }
4022
4023 final String formattedDate = sAnrFileDateFormat.format(new Date());
4024 final File anrFile = new File(tracesDir, ANR_FILE_PREFIX + formattedDate);
4025
4026 if (anrFile.createNewFile()) {
4027 FileUtils.setPermissions(anrFile.getAbsolutePath(), 0600, -1, -1); // -rw-------
4028 return anrFile;
4029 } else {
4030 throw new IOException("Unable to create ANR dump file: createNewFile failed");
4031 }
4032 }
2.如果是超过 60s , 会再一次 ActivityManagerService.dumpStackTraces,以及 EventLog.writeEvent。还会触发kernel的信息打印 doSysRq('w'); doSysRq('l');
private void doSysRq(char c) {
723 try {
724 FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
725 sysrq_trigger.write(c);
726 sysrq_trigger.close();
727 } catch (IOException e) {
728 Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
729 }
730 }
mActivity.addErrorToDropBox和 FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,subject);
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
发生了Watchdog是否重启依据allowRestart,这个值是由
public void setAllowRestart(boolean allowRestart) {
synchronized (this) {
mAllowRestart = allowRestart;
}
}
而这个方法的调用是由AMS调用的。
public void hang(final IBinder who, boolean allowRestart) {
9292 if (checkCallingPermission(android.Manifest.permission.SET_ACTIVITY_WATCHER)
9293 != PackageManager.PERMISSION_GRANTED) {
9294 throw new SecurityException("Requires permission "
9295 + android.Manifest.permission.SET_ACTIVITY_WATCHER);
9296 }
9297
9298 final IBinder.DeathRecipient death = new DeathRecipient() {
9299 @Override
9300 public void binderDied() {
9301 synchronized (this) {
9302 notifyAll();
9303 }
9304 }
9305 };
9306
9307 try {
9308 who.linkToDeath(death, 0);
9309 } catch (RemoteException e) {
9310 Slog.w(TAG, "hang: given caller IBinder is already dead.");
9311 return;
9312 }
9313
9314 synchronized (this) {
9315 Watchdog.getInstance().setAllowRestart(allowRestart);
9316 Slog.i(TAG, "Hanging system process at request of pid " + Binder.getCallingPid());
9317 synchronized (death) {
9318 while (who.isBinderAlive()) {
9319 try {
9320 death.wait();
9321 } catch (InterruptedException e) {
9322 }
9323 }
9324 }
9325 Watchdog.getInstance().setAllowRestart(true);
9326 }
9327 }
binder进程死掉就会触发