Android O中增加了一个新的功能,当核心系统组件发生循环崩溃的时候,就会触发“救援程序”,也就是本篇要介绍的RescueParty 。然后RescueParty 会通过自身判断逻辑来进行崩溃级别的提升,直到最后触发ResetFactory,进而进入Recovery,然后提示用户恢复出厂设置。
接下来根据源码(基于Android Q)来分析下是如何触发救援程序以及进入Recovery。
涉及到的源码路径:
frameworks/base/services/core/java/com/android/server/RescueParty.java
frameworks/base/services/java/com/android/server/SystemServer.java
frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
frameworks/base/services/core/java/com/android/server/am/AppErrors.java
通过RescueParty类的概述可以了解到,程序崩溃的足够频繁,并且逐步上升级别,最终通过提示用户擦除数据解决。救援级别分为5级,只有达到LEVEL_FACTORY_RESET这个级别,才会触发设备进入Recovery。而通过"sys.rescue_level"这个属性可以查询当前处于什么级别?
/**
* Utilities to help rescue the system from crash loops. Callers are expected to
* report boot events and persistent app crashes, and if they happen frequently
* enough this class will slowly escalate through several rescue operations
* before finally rebooting and prompting the user if they want to wipe data as
* a last resort.
*/
//救援级别
static final int LEVEL_NONE = 0;
static final int LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS = 1;
static final int LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES = 2;
static final int LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS = 3;
static final int LEVEL_FACTORY_RESET = 4;
接着来看下如何触发级别上升,并最终达到LEVEL_FACTORY_RESET。RescueParty 类中有一个用来提升级别的方法incrementRescueLevel,从方法名就可以看出作用。存储方式就是从属性中读出,然后+1再写进属性,默认的值就是0。
/**
* Escalate to the next rescue level. After incrementing the level you'll
* probably want to call {@link #executeRescueLevel(Context)}.
*/
private static void incrementRescueLevel(int triggerUid) {
final int level = MathUtils.constrain(
SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE) + 1,
LEVEL_NONE, LEVEL_FACTORY_RESET);
SystemProperties.set(PROP_RESCUE_LEVEL, Integer.toString(level));
EventLogTags.writeRescueLevel(level, triggerUid);
logCriticalInfo(Log.WARN, "Incremented rescue level to "
+ levelToString(level) + " triggered by UID " + triggerUid);
}
该方法是私有方法,本类中的调用方有两处,分别对应了两种不同的场景——
永久性系统应用在 30 秒内崩溃 5 次以上和system_server 在 10 分钟内重启 5 次以上。
1、noteAppCrash; //记录App奔溃
2、noteBoot; //记录Boot阶段奔溃,也就是system_server奔溃
//场景1:
/**
* Take note of a persistent app or apex module crash. If we notice too many of these
* events happening in rapid succession, we'll send out a rescue party.
*/
public static void noteAppCrash(Context context, int uid) {
if (isDisabled()) return; //该功能是否已经关闭
Threshold t = sApps.get(uid); //查看该uid对应的app是否已经出现过crash
if (t == null) {
t = new AppThreshold(uid);
sApps.put(uid, t); //未出现就新增一个记录
}
if (t.incrementAndTest()) { //判断是否需要提升救援级别
t.reset();
incrementRescueLevel(t.uid);
executeRescueLevel(context);
}
}
//场景2
/**
* Take note of a boot event. If we notice too many of these events
* happening in rapid succession, we'll send out a rescue party.
*/
public static void noteBoot(Context context) {
if (isDisabled()) return;
if (sBoot.incrementAndTest()) {
sBoot.reset();
incrementRescueLevel(sBoot.uid);
executeRescueLevel(context);
}
}
场景1就是AMS记录到app crash后调到AppErrors中的crashApplication方法来处理该crash,之后会判断是常驻进程或者是Apex组件化模块就调用noteAppCrash触发救援程序。
void crashApplicationInner(ProcessRecord r, ApplicationErrorReport.CrashInfo crashInfo,
int callingPid, int callingUid) {
{
...
if (r.isPersistent() || isApexModule) {
// If a persistent app or apex module is stuck in a crash loop, the device isn't
// very usable, so we want to consider sending out a rescue party.
RescueParty.noteAppCrash(mContext, r.uid);
}
...
}
场景2则判断是否存在boot阶段持续崩溃,也就是会不会有连续重启的事件发生。该场景触发是在system_server进程启动的过程中。所以判断的机制相当于是对system_server的监控,实际上该进程也可以理解为一个特殊的app。
private void startBootstrapServices() {
...
RescueParty.noteBoot(mSystemContext);
...
}
该场景的触发是针对Q以及之前的源码逻辑,R上已经将该逻辑调整到PackagesWatchDog这个类中来实现,此类也是在system_server启动过程拉起来的,因此殊途同归。
最后来看下RescueParty的逻辑,如何实现crash的监控。前面提到,每次触发救援级别提升前,都会先通过incrementAndTest来判断是否需要提升,所以该方法是实际上对救援级别提升的判断标准。
public boolean incrementAndTest() {
final long now = getElapsedRealtime(); //获取系统现在的时间
final long window = now - getStart(); //系统时间与救援开始时间的差值
//triggerWindow对应BOOT_TRIGGER_WINDOW_MILLIS或者PERSISTENT_APP_CRASH_TRIGGER_WINDOW_MILLIS分别对应600s和30s
if (window > triggerWindow) { //超过预定的时间后则重新开始,计数为1,时间从现在开始计算;
setCount(1);
setStart(now);
return false;
} else { //反之,如果在预定的时间内再次触发,则计数+1,
int count = getCount() + 1;
setCount(count);
EventLogTags.writeRescueNote(uid, count, window);
Slog.w(TAG, "Noticed " + count + " events for UID " + uid + " in last "
+ (window / 1000) + " sec");
return (count >= triggerCount); //返回计数是否达到预期,也就是5次
}
}
以上便是RescueParty的整体逻辑,逻辑清晰明了。目的也很纯粹,就是通过监控崩溃的程序来确认系统是否遇到了严重的问题,通过提示用户清除数据来恢复。