目录
一 watchdog的创建
/frameworks/base/services/java/com/android/server/SystemServer.java
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
t.traceBegin("startBootstrapServices");
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
t.traceBegin("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
t.traceEnd();
...
...
}
systemserver 进程在启动是会创建watchdog 实例,并启动这个线程。
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
watchdog的实例只会有一个。
二 watchdog的成员和初始化
private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
private final HandlerChecker mMonitorChecker;
这里列举两个比较重要的成员变量,一个是HandlerChecker 类的list,一个是HandlerChecker对象。
2.1 理解HandlerChecker
HandlerChecker 是WatchDog的内部类,实现了Runnable 接口,这个类又维护了Monitor 类的list
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
private boolean mCompleted;
Monitor 是一个接口,这个接口只有一个方法,如果想要实现这个接口,必须要实现monitor()方法。
2.2 mMonitorChecker
这里单独维护了一个HandlerChecker 对象,实际上是为了fg.thread 线程所用。
2.3 mCompleted
用来标记被监测的对象是否已经完成了monitor 方法。
2.3 watchdog的初始化
private Watchdog() {
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// And the animation thread.
mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
"animation thread", DEFAULT_TIMEOUT));
// And the surface animation thread.
mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
mOpenFdMonitor = OpenFdMonitor.create();
mInterestingJavaPids.add(Process.myPid());
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
- 将mMonitorChecker 成员变量初始化,并加入到HandlerChecker 类的list
- 将需要监测的线程加入到HandlerChecker 类的list
2.4 watchdog的run 方法
- 整个run 方法其实一个死循环
- 开始让每一个HandlerChecker 都开始执行scheduleCheckLocked,看下这个函数
public void scheduleCheckLocked() {
if (mCompleted) {
// Safe to update monitors in queue, Handler is not in the middle of work
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) {
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
设置mCompleted这个变量;将当前的runnable 对象放到messagequeue,之后执行它的run 方法,实际上也就是执行HandlerChecker的run方法
- 再来看HandlerChecker的run方法
@Override
public void run() {
// Once we get here, we ensure that mMonitors does not change even if we call
// #addMonitorLocked because we first add the new monitors to mMonitorQueue and
// move them to mMonitors on the next schedule when mCompleted is true, at which
// point we have completed execution of this method.
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
执行每一个加入mMonitors的monitor 方法,前面我们叙述过,fg 线程中会有很多monitor对象。
- watchdog 线程wait 30s,30s之后监测每一个HandlerChecker
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
这里用时间来判断,小于30s则继续等待,小于60s,说明已经到超时的一半了。
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
开始通知AMS dump了
watchdog 继续监测30s,这时候如果有超时的HandlerChecker,就把他加入到block HandlerChecker里面
private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
if (hc.isOverdueLocked()) {
checkers.add(hc);
}
}
return checkers;
}
系统的线程处理时间超过60s了,说明系统已经hang住了,按照android的机制就需要重启
- 通知AMS 开始dump
- 用sysrq打印kernel里block的线程
- 输出到dropbox
最后就是重启。
三 总结
- watchdog 会监测一些线程,其中fg 线程会添加monitor对象
- 监测机制会先监测有没有30s,30s后没有完成,会先dump一次
- 60s后会dump一次,之后打sysrq,系统重启。