不知道各位有没有遇过,就是有时系统开机异常,卡在开机动画界面或者开机只一会就会重启进入Recovery界面。我这边就遇到,Rainbow项目在刷机后,开机后会自动重启,进入Recovery界面。经查找,是由于有一个进程在系统开机的时候,不断崩溃,触发了RescueParty机制,今天在这带大家稍微梳理下这个机制。
首先这个机制被称为RescueParty
目前市场上的手机消费者包括资深用户,当他们的手机出现无限循环启动的异常时,用户没有办法修复异常只能通过设备商售后处理。
Google在Android 8.0加入该新功能,称之为rescue party救援程序。
主要监控系统核心程序出现循环崩溃的时候,会启动该程序,根据不同的救援级别做出一系列操作,看是否可恢复设备,最严重的时候则是通过进入recovery然后提供用户清空用户数据恢复出厂设置解决。
Code基于Android 12
frameworks/base/services/core/java/com/android/server/RescueParty.java
救援级别:
//什么也不做
static final int LEVEL_NONE = 0;
//主要针对非系统进程的属性设置进行重置
static final int LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS = 1;
//针对非系统进程属性,来自系统默认的属性重置,其他删除
static final int LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES = 2;
//所有进程系统默认的属性重置,其他删除
static final int LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS = 3;
//重启设备
static final int LEVEL_WARM_REBOOT = 4;
//尝试恢复出厂设置
static final int LEVEL_FACTORY_RESET = 5;
不同的级别对应下面的代码逻辑
private static void executeRescueLevelInternal(Context context, int level, @Nullable
String failedPackage) throws Exception {
switch (level) {
case LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS:
resetAllSettingsIfNecessary(context, Settings.RESET_MODE_UNTRUSTED_DEFAULTS, level);
resetDeviceConfig(context, /*isScoped=*/true, failedPackage);
break;
case LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES:
resetAllSettingsIfNecessary(context, Settings.RESET_MODE_UNTRUSTED_CHANGES, level);
resetDeviceConfig(context, /*isScoped=*/true, failedPackage);
break;
case LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS:
resetAllSettingsIfNecessary(context, Settings.RESET_MODE_TRUSTED_DEFAULTS,level);
resetDeviceConfig(context, /*isScoped=*/false, failedPackage);
break;
case LEVEL_WARM_REBOOT:
PowerManager pm = context.getSystemService(PowerManager.class);
pm.reboot(TAG);
break;
case LEVEL_FACTORY_RESET:
RecoverySystem.rebootPromptAndWipeUserData(context, TAG);
break;
}
}
触发场景:
(1)system_server 在 5 分钟内重启 5 次以上调整一次级别。(Android 12 为10分钟内5次)
(2)永久性系统应用在 30 秒内崩溃 5 次以上调整一次级别。(Android 12 默认为60秒内5次)
当检测到上述某种情况时,救援程序会将其上报给下一救援级别、处理与该级别相关联的任务,并让设备继续运行,看看能否恢复。清除或重置内容的程度随级别而增加。最高级别会提示用户将设备恢复出厂设置。
别的逻辑都比较简单,我们看一下恢复出厂的逻辑:
public static void rebootPromptAndWipeUserData(Context context, String reason)
throws IOException {
...
//检测是否可以通过回滚消除问题,
// If we are running in checkpointing mode, we should not prompt a wipe.
// Checkpointing may save us. If it doesn't, we will wind up here again.
if (checkpointing) {
try {
vold.abortChanges("rescueparty", false);
Log.i(TAG, "Rescue Party requested wipe. Aborting update");
} catch (Exception e) {
Log.i(TAG, "Rescue Party requested wipe. Rebooting instead.");
PowerManager pm = (PowerManager) context.getSystemService(Context.POWER_SERVICE);
pm.reboot("rescueparty");
}
return;
}
//执行恢复出厂命令--prompt_and_wipe_data
bootCommand(context, null, "--prompt_and_wipe_data", reasonArg, localeArg);
}
private static void bootCommand(Context context, String... args) throws IOException {
StringBuilder command = new StringBuilder();
for (String arg : args) {
if (!TextUtils.isEmpty(arg)) {
command.append(arg);
command.append("\n");
}
}
// Write the command into BCB (bootloader control block) and boot from
// there. Will not return unless failed.
RecoverySystem rs = (RecoverySystem) context.getSystemService(Context.RECOVERY_SERVICE);
rs.rebootRecoveryWithCommand(command.toString());
}
public void rebootRecoveryWithCommand(String command) {
//最终还是调用PowerManager,触发重启
PowerManager pm = mInjector.getPowerManager();
pm.reboot(PowerManager.REBOOT_RECOVERY);
}
救援程序的禁用场景:
(1)PROP_ENABLE_RESCUE属性值为false,并且PROP_DEVICE_CONFIG_DISABLE_FLAG属性为true
(2)eng版本下
(3)调试版本,并且usb连接电脑
(4)PROP_DISABLE_RESCUE为true
逻辑控制代码:
private static boolean isDisabled() {
// Check if we're explicitly enabled for testing
if (SystemProperties.getBoolean(PROP_ENABLE_RESCUE, false)) {
return false;
}
// We're disabled if the DeviceConfig disable flag is set to true.
// This is in case that an emergency rollback of the feature is needed.
if (SystemProperties.getBoolean(PROP_DEVICE_CONFIG_DISABLE_FLAG, false)) {
Slog.v(TAG, "Disabled because of DeviceConfig flag");
return true;
}
// We're disabled on all engineering devices
if (Build.IS_ENG) {
Slog.v(TAG, "Disabled because of eng build");
return true;
}
// We're disabled on userdebug devices connected over USB, since that's
// a decent signal that someone is actively trying to debug the device,
// or that it's in a lab environment.
if (Build.IS_USERDEBUG && isUsbActive()) {
Slog.v(TAG, "Disabled because of active USB connection");
return true;
}
// One last-ditch check
if (SystemProperties.getBoolean(PROP_DISABLE_RESCUE, false)) {
Slog.v(TAG, "Disabled because of manual property");
return true;
}
return false;
}
逻辑梳理:
1.系统开机早期,通过registerHealthObserver注册PackageWatchdog崩溃事件的监听,RecoverySystemService服务后续在Recovery重启时会用到,先起起来
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
...
// Bring up recovery system in case a rescue party needs a reboot
t.traceBegin("StartRecoverySystemService");
mSystemServiceManager.startService(RecoverySystemService.Lifecycle.class);
t.traceEnd();
// Now that we have the bare essentials of the OS up and running, take
// note that we just booted, which might send out a rescue party if
// we're stuck in a runtime restart loop.
RescueParty.registerHealthObserver(mSystemContext);
PackageWatchdog.getInstance(mSystemContext).noteBoot();
...
}
/** Register the Rescue Party observer as a Package Watchdog health observer */
public static void registerHealthObserver(Context context) {
PackageWatchdog.getInstance(context).registerHealthObserver(
RescuePartyObserver.getInstance(context));
}
这里要注意RescuePartyObserver这个内部类实现了PackageWatchdog.PackageHealthObserver接口,比如说当PackageWatchdog检测到应用崩溃就会回调到execute方法,这里的崩溃包括Crash和Anr,最后executeRescueLevel->executeRescueLevelInternal回到上面不同的level等级做处理
@Override
public boolean execute(@Nullable VersionedPackage failedPackage,
@FailureReasons int failureReason, int mitigationCount) {
if (isDisabled()) {
return false;
}
if (failureReason == PackageWatchdog.FAILURE_REASON_APP_CRASH
|| failureReason == PackageWatchdog.FAILURE_REASON_APP_NOT_RESPONDING) {
//获取当前救援等级
final int level = getRescueLevel(mitigationCount);
executeRescueLevel(mContext,
failedPackage == null ? null : failedPackage.getPackageName(), level);
return true;
} else {
return false;
}
}
当设备具有有效的 USB 数据连接时,系统会停止所有救援事件,因为这是一个较强的信号,表示有人正在调试设备。如需停止此类抑制行为,请运行以下命令
adb shell setprop persist.sys.enable_rescue 1
在此处,您可以触发系统或界面崩溃循环。
如需触发低级 system_server 崩溃循环,请运行以下命令:
adb shell setprop debug.crash_system 1
对应逻辑
frameworks/base/services/java/com/android/server/SystemServer.java
// For debugging RescueParty
if (Build.IS_DEBUGGABLE && SystemProperties.getBoolean("debug.crash_system", false)) {
throw new RuntimeException();
}
如需触发中级 SystemUI 崩溃循环,请运行以下命令:
adb shell setprop debug.crash_sysui 1
对应逻辑
frameworks/base/packages/SystemUI/src/com/android/systemui/SystemUIService.java
// For debugging RescueParty
if (Build.IS_DEBUGGABLE && SystemProperties.getBoolean("debug.crash_sysui", false)) {
throw new RuntimeException();
}
然后通过命令重新启动上层系统就可以触发这些报错了
adb shell root
adb shell stop
adb shell start
那到这里可能有人要好奇了,PackageWatchdog它是怎么知道system_server或者应用出问题了呢?
1.如何判断system_server出现问题
首先我们知道PackageWatchdog是运行在system_server里的,system_server异常会直接导致system_server重启,出问题时根本走不到PackageWatchdog里,那它是如何触发的呢?
我们可以称之为Boot记录法,还记得我们上面说过系统开机的时候会触发一次PackageWatchdog的noteBoot方法吗?这个方法其实用处很简单,只是起到通知作用,告诉PackageWatchdog我这边system_server在进行一次Boot引导了,此时PackageWatchdog会通过内部类BootThreshold进行一次记录,将当前时间写入sys.rescue_boot_start,然后自增sys.rescue_boot_count的值记录system_server重启的次数,后续异常重新noteBoot时这个值就会增加一次,当这些时间和次数都超出预设值时就会触发一个level等级,触发对应RescueParty逻辑,并将当前level等级保存到sys.boot_mitigation_count中,有必要也会保存到mate中,然后重置计数器。
public void noteBoot() {
synchronized (mLock) {
//自增Boot阈值计数器,并检查是否超出预设值
if (mBootThreshold.incrementAndTest()) {
//重置计数器
mBootThreshold.reset();
//自增当前level等级
int mitigationCount = mBootThreshold.getMitigationCount() + 1;
PackageHealthObserver currentObserverToNotify = null;
int currentObserverImpact = Integer.MAX_VALUE;
//处理其他Observers
for (int i = 0; i < mAllObservers.size(); i++) {
final ObserverInternal observer = mAllObservers.valueAt(i);
PackageHealthObserver registeredObserver = observer.registeredObserver;
if (registeredObserver != null) {
int impact = registeredObserver.onBootLoop(mitigationCount);
if (impact != PackageHealthObserverImpact.USER_IMPACT_NONE
&& impact < currentObserverImpact) {
currentObserverToNotify = registeredObserver;
currentObserverImpact = impact;
}
}
}
if (currentObserverToNotify != null) {
//保存当前level等级
mBootThreshold.setMitigationCount(mitigationCount);
mBootThreshold.saveMitigationCountToMetadata();
//回调到RescueParty,执行executeRescueLevel
currentObserverToNotify.executeBootLoopMitigation(mitigationCount);
}
}
}
}
/** Increments the boot counter, and returns whether the device is bootlooping. */
public boolean incrementAndTest() {
//从Meta中读取level等级
readMitigationCountFromMetadataIfNecessary();
final long now = mSystemClock.uptimeMillis();
if (now - getStart() < 0) {
Slog.e(TAG, "Window was less than zero. Resetting start to current time.");
setStart(now);
setMitigationStart(now);
}
//长时间未发生异常,重置level记录
if (now - getMitigationStart() > DEFAULT_DEESCALATION_WINDOW_MS) {
setMitigationCount(0);
setMitigationStart(now);
}
final long window = now - getStart();
//阈值条件内未重新Boot,重置system_server重启计数器
if (window >= mTriggerWindow) {
setCount(1);
setStart(now);
return false;
} else {
//触发阈值条件,计数器+1,当大约mBootTriggerCount时返回true
int count = getCount() + 1;
setCount(count);
EventLogTags.writeRescueNote(Process.ROOT_UID, count, window);
return count >= mBootTriggerCount;
}
}
2.如何检查应用出现问题
应用问题一般分两种Crash和Anr,这些对于当前系统来说都是不正常的情况,都需要记录。
先说Crash,首先我们得知道Android系统是如何获取到异常的。
Java中异常发生时,如果没有一个异常处理器来处理这个异常,程序会被中止。在 JVM 当中有一个预先定义好的异常处理层次结构。结构中的第一层是try catch 块,代码类似:
try {
crashyCode()
} catch (Exception e) {
...
}
如果第一个 catch 块无法处理这个异常,异常便会向此方法的调用方进行传递。如果所有的 catch 块都无法处理某个异常,该异常便会交由当前线程的 UncaughtExceptionHandler 来处理。
setUncaughtExceptionHandler 可以调协在当前线程里,未被 catch 块捕获的异常处理流程会先来到这里;还可以设置在 ThreadGroup 当中,当前线程的 UncaughtExceptionHandler 无法处理的异常会在这里被处理。如果 ThreadGroup 的 UncaughtExceptionHandler 还是无法处理该异常,那么最终将会被交由默认异常处理程序 ( default uncaught exception handler ) 处理,也就是打印出异常栈,并终止程序。当然你也可以覆盖这种行为:
Thread.setDefaultUncaughtExceptionHandler{/*自定义实现的UncaughtExceptionHandler*/}
Android的异常处理也是基于这个进行设计的,首先在RuntimeInit中
//注册UncaughtExceptionHandler处理函数,出现未捕获的Crash时调用uncaughtException方法
Thread.setDefaultUncaughtExceptionHandler(new KillApplicationHandler(loggingHandler));
private static class KillApplicationHandler implements Thread.UncaughtExceptionHandler {
@Override
public void uncaughtException(Thread t, Throwable e) {
try {
//触发AMS的Crash机制
ActivityManager.getService().handleApplicationCrash(
mApplicationObject, new ApplicationErrorReport.ParcelableCrashInfo(e));
} catch (Throwable t2) {
...
} finally {
// Try everything to make sure this process goes away.
Process.killProcess(Process.myPid());
System.exit(10);
}
}
这边我们将异常捕获到我们的Android代码当中,并由AMS中handleApplicationCrash->handleApplicationCrashInner,此时需要引入AppErrors类,AMS意思调用的其crashApplication进一步处理,
AppErrorscrashApplication->crashApplicationInner
private void crashApplicationInner(ProcessRecord r, ApplicationErrorReport.CrashInfo crashInfo,
int callingPid, int callingUid) {
if (r != null) {
mPackageWatchdog.onPackageFailure(r.getPackageListWithVersionCode(),
PackageWatchdog.FAILURE_REASON_APP_CRASH);
}
这里就调用了PackageWatchdog的onPackageFailure方法,这里逻辑就不细讲了,主要是通过MonitoredPackage做一些crash信息的记录,不同的packageName保存到不同的MonitoredPackage对象中,并通过记录数量以及时间判断是否触发level级别处理
public void onPackageFailure(List<VersionedPackage> packages,
@FailureReasons int failureReason) {
mLongTaskHandler.post(() -> {
synchronized (mLock) {
if (mAllObservers.isEmpty()) {
return;
}
boolean requiresImmediateAction = (failureReason == FAILURE_REASON_NATIVE_CRASH
|| failureReason == FAILURE_REASON_EXPLICIT_HEALTH_CHECK);
if (requiresImmediateAction) {
handleFailureImmediately(packages, failureReason);
} else {
for (int pIndex = 0; pIndex < packages.size(); pIndex++) {
VersionedPackage versionedPackage = packages.get(pIndex);
// Observer that will receive failure for versionedPackage
PackageHealthObserver currentObserverToNotify = null;
int currentObserverImpact = Integer.MAX_VALUE;
MonitoredPackage currentMonitoredPackage = null;
// Find observer with least user impact
for (int oIndex = 0; oIndex < mAllObservers.size(); oIndex++) {
ObserverInternal observer = mAllObservers.valueAt(oIndex);
PackageHealthObserver registeredObserver = observer.registeredObserver;
//对不同的PackageName分别检测记录
if (registeredObserver != null
&& observer.onPackageFailureLocked(
versionedPackage.getPackageName())) {
MonitoredPackage p = observer.getMonitoredPackage(
versionedPackage.getPackageName());
int mitigationCount = 1;
if (p != null) {
mitigationCount = p.getMitigationCountLocked() + 1;
}
int impact = registeredObserver.onHealthCheckFailed(
versionedPackage, failureReason, mitigationCount);
if (impact != PackageHealthObserverImpact.USER_IMPACT_NONE
&& impact < currentObserverImpact) {
currentObserverToNotify = registeredObserver;
currentObserverImpact = impact;
currentMonitoredPackage = p;
}
}
}
// Execute action with least user impact
if (currentObserverToNotify != null) {
int mitigationCount = 1;
if (currentMonitoredPackage != null) {
currentMonitoredPackage.noteMitigationCallLocked();
mitigationCount =
currentMonitoredPackage.getMitigationCountLocked();
}
currentObserverToNotify.execute(versionedPackage,
failureReason, mitigationCount);
}
}
}
}
});
}
Anr和Crash类似,都是在AppErrors中处理的,入口是handleShowAnrUi,这里就不赘述了
void handleShowAnrUi(Message msg) {
// Notify PackageWatchdog without the lock held
if (packageList != null) {
mPackageWatchdog.onPackageFailure(packageList,
PackageWatchdog.FAILURE_REASON_APP_NOT_RESPONDING);
}
}
判断是否是RescueParty问题可以搜索关键字:RescueParty
RescueParty: Attempting rescue level FACTORY_RESET
https://source.android.google.cn/devices/tech/debug/rescue-party