售后项目报出用户手中机器低概率卡死重启的问题
问题分析
获取log分析是触发了SWT(Software Watchdog Timeout)
"android.fg" prio=5 tid=13 Native
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x14f809e0 self=0xb4000073d4129f00
| sysTid=2396 nice=0 cgrp=foreground sched=0/0 handle=0x71e1d12cb0
native: #00 pc 00000000000a07c8 /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+8)
native: #01 pc 000000000005b304 /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+156)
native: #02 pc 0000000000052ddc /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+292)
native: #03 pc 000000000005402c /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+116)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(BinderProxy.java:571)
at android.os.IVold$Stub$Proxy.monitor(IVold.java:1657)
at com.android.server.StorageManagerService.monitor(StorageManagerService.java:4671)
at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:277)
at android.os.Handler.handleCallback(Handler.java:938)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loopOnce(Looper.java:201)
at android.os.Looper.loop(Looper.java:288)
at android.os.HandlerThread.run(HandlerThread.java:67)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
android.fg长时间卡在StorageManagerService.monitor方法
而StorageManagerService.monitor方法就是调用底层vold.monitor去判断vold进程是不是卡住
frameworks/base/services/core/java/com/android/server/StorageManagerService.java
public void monitor() {
try {
mVold.monitor();
} catch (Exception e) {
Slog.wtf(TAG, e);
}
}
底层vold的monitor方法实现:
system/vold/VoldNativeService.cpp,可以看到是去尝试获得两个锁对象
binder::Status VoldNativeService::monitor() {
ENFORCE_SYSTEM_OR_ROOT;
// Simply acquire/release each lock for watchdog
{ ACQUIRE_LOCK; }
{ ACQUIRE_CRYPT_LOCK; }
return Ok();
}
#define ACQUIRE_LOCK \
std::lock_guard<std::mutex> lock(VolumeManager::Instance()->getLock()); \
ATRACE_CALL();
#define ACQUIRE_CRYPT_LOCK \
std::lock_guard<std::mutex> lock(VolumeManager::Instance()->getCryptLock()); \
ATRACE_CALL();
根据出问题时的vold堆栈证实此时vold就是卡在这个lock上
"Binder:637_2" sysTid=637
#00 pc 000000000004dc5c /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#01 pc 00000000000524b8 /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+148) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#02 pc 00000000000b61d4 /apex/com.android.runtime/lib64/bionic/libc.so (NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+224) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#03 pc 0000000000097040 /system/lib64/libc++.so (std::__1::mutex::lock()+12) (BuildId: 8d6dfa11381841c5fb0337829babf4cb)
#04 pc 000000000005da10 /system/bin/vold (android::vold::VoldNativeService::monitor()+76) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#05 pc 00000000000ede4c /system/bin/vold (android::os::BnVold::onTransact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+8756) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#06 pc 000000000004984c /system/lib64/libbinder.so (android::BBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+252) (BuildId: ffe164de737e9c91883e81d145084107)
#07 pc 0000000000053624 /system/lib64/libbinder.so (android::IPCThreadState::executeCommand(int)+1028) (BuildId: ffe164de737e9c91883e81d145084107)
#08 pc 0000000000053154 /system/lib64/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+164) (BuildId: ffe164de737e9c91883e81d145084107)
#09 pc 00000000000539e4 /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+72) (BuildId: ffe164de737e9c91883e81d145084107)
#10 pc 000000000002f9bc /system/bin/vold (main+2404) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#11 pc 0000000000049f10 /apex/com.android.runtime/lib64/bionic/libc.so (__libc_init+100) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
也就是说vold某个线程一直在调用ACQUIRE_LOCK或者ACQUIRE_CRYPT_LOCK进行持锁,对照vold的堆栈分析发现大部分线程都是__futex_wait_ex等锁或者poll Idle状态,只有一个线程是在做mount的动作
"Binder:637_5" sysTid=807
#00 pc 00000000000a0524 /apex/com.android.runtime/lib64/bionic/libc.so (read+4) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#01 pc 00000000000ac0e4 /apex/com.android.runtime/lib64/bionic/libc.so (__sread+48) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#02 pc 00000000000abfcc /apex/com.android.runtime/lib64/bionic/libc.so (__srefill+264) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#03 pc 00000000000ae31c /apex/com.android.runtime/lib64/bionic/libc.so (fgets_unlocked+96) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#04 pc 00000000000ae274 /apex/com.android.runtime/lib64/bionic/libc.so (fgets+68) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#05 pc 00000000000537c0 /system/bin/vold (android::vold::ForkExecvp(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*, char*)+828) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#06 pc 00000000000b1200 /system/bin/vold (android::vold::vfat::Check(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+480) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#07 pc 00000000000b3c5c /system/bin/vold (android::vold::PublicVolume::doMount()+136) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#08 pc 00000000000b70bc /system/bin/vold (android::vold::VolumeBase::mount()+80) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#09 pc 000000000005e52c /system/bin/vold (android::vold::VoldNativeService::mount(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, android::sp<android::os::IVoldMountCallback> const&)+220) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#10 pc 00000000000ec4d4 /system/bin/vold (android::os::BnVold::onTransact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+2236) (BuildId: 5c462da9ef3e6b5527921f299c7ea1db)
#11 pc 000000000004984c /system/lib64/libbinder.so (android::BBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+252) (BuildId: ffe164de737e9c91883e81d145084107)
#12 pc 0000000000053624 /system/lib64/libbinder.so (android::IPCThreadState::executeCommand(int)+1028) (BuildId: ffe164de737e9c91883e81d145084107)
#13 pc 0000000000053154 /system/lib64/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+164) (BuildId: ffe164de737e9c91883e81d145084107)
#14 pc 00000000000539e4 /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+72) (BuildId: ffe164de737e9c91883e81d145084107)
#15 pc 000000000007fed0 /system/lib64/libbinder.so (android::PoolThread::threadLoop()+28) (BuildId: ffe164de737e9c91883e81d145084107)
#16 pc 0000000000013454 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+264) (BuildId: d206f83f5ab9c782beffd3b3c018aeb0)
#17 pc 0000000000012cc4 /system/lib64/libutils.so (thread_data_t::trampoline(thread_data_t const*)+408) (BuildId: d206f83f5ab9c782beffd3b3c018aeb0)
#18 pc 00000000000b508c /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+264) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
#19 pc 0000000000052f9c /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+68) (BuildId: a0d34b9798d47ef26fc0fe8436db3b96)
结合堆栈分析调用路径发现这个monut方法中正好有持锁的动作:
system/vold/VoldNativeService.cpp
binder::Status VoldNativeService::mount(k) {
ACQUIRE_LOCK;
int res = vol->mount();
}
也就是说这个线程中mount动作一直没有结束,一直持有锁导致主线程中monitor方法无法正常调用。
总结下以上发现的逻辑:当我们在插入sdcard时,系统会自动帮我们进行mount工作,其中会进行持锁,vold的mount是在PublicVolume doMount方法中实现的,先会根据不同的sdcard类型调用不同的Check方法对sdcard卡是否损坏进行检测和修复
system/vold/model/PublicVolume.cpp
status_t PublicVolume::doMount() {
bool isVisible = getMountFlags() & MountFlags::kVisible;
readMetadata();
if (mFsType == "vfat" && vfat::IsSupported()) {
if (vfat::Check(mDevPath)) {
LOG(ERROR) << getId() << " failed filesystem check";
return -EIO;
}
} else if (mFsType == "exfat" && exfat::IsSupported()) {
if (exfat::Check(mDevPath)) {
LOG(ERROR) << getId() << " failed filesystem check";
return -EIO;
}
} else {
LOG(ERROR) << getId() << " unsupported filesystem " << mFsType;
return -EIO;
}
sdcard格式类型有很多,不同格式调用的方法不一样,上面堆栈可以看到,这里走的是vfat格式
system/vold/fs/Vfat.cpp
static const char* kFsckPath = "/system/bin/fsck_msdos";
status_t Check(const std::string& source) {
int pass = 1;
int rc = 0;
do {
std::vector<std::string> cmd;
cmd.push_back(kFsckPath);
cmd.push_back("-p");
cmd.push_back("-f");
cmd.push_back("-y");
cmd.push_back(source);
// Fat devices are currently always untrusted
rc = ForkExecvp(cmd, nullptr, sFsckUntrustedContext);
if (rc < 0) {
LOG(ERROR) << "Filesystem check failed due to logwrap error";
errno = EIO;
return -1;
}
switch (rc) {
case 0:
LOG(INFO) << "Filesystem check completed OK";
return 0;
case 2:
LOG(ERROR) << "Filesystem check failed (not a FAT filesystem)";
errno = ENODATA;
return -1;
case 4:
if (pass++ <= 3) {
LOG(WARNING) << "Filesystem modified - rechecking (pass " << pass << ")";
continue;
}
LOG(ERROR) << "Failing check after too many rechecks";
errno = EIO;
return -1;
case 8:
LOG(ERROR) << "Filesystem check failed (no filesystem)";
errno = ENODATA;
return -1;
default:
LOG(ERROR) << "Filesystem check failed (unknown exit code " << rc << ")";
errno = EIO;
return -1;
}
} while (0);
return 0;
}
这里检测其实就是使用fsck_msdos -p -f -y 命令对sdcard进行校验,通过fork另一进程去执行的,父进程一直会等待子进程的结果
system/vold/Utils.cpp
status_t ForkExecvp(const std::vector<std::string>& args, std::vector<std::string>* output,
security_context_t context) {
auto argv = ConvertToArgv(args);
android::base::unique_fd pipe_read, pipe_write;
if (!android::base::Pipe(&pipe_read, &pipe_write)) {
PLOG(ERROR) << "Pipe in ForkExecvp";
return -errno;
}
pid_t pid = fork();
if (pid == 0) {
if (context) {
if (setexeccon(context)) {
LOG(ERROR) << "Failed to setexeccon in ForkExecvp";
abort();
}
}
pipe_read.reset();
if (dup2(pipe_write.get(), STDOUT_FILENO) == -1) {
PLOG(ERROR) << "dup2 in ForkExecvp";
_exit(EXIT_FAILURE);
}
pipe_write.reset();
execvp(argv[0], const_cast<char**>(argv.data()));
PLOG(ERROR) << "exec in ForkExecvp";
_exit(EXIT_FAILURE);
}
if (pid == -1) {
PLOG(ERROR) << "fork in ForkExecvp";
return -errno;
}
pipe_write.reset();
auto st = ReadLinesFromFdAndLog(output, std::move(pipe_read));
if (st != 0) return st;
int status;
if (waitpid(pid, &status, 0) == -1) {
PLOG(ERROR) << "waitpid in ForkExecvp";
return -errno;
}
if (!WIFEXITED(status)) {
LOG(ERROR) << "Process did not exit normally, status: " << status;
return -ECHILD;
}
if (WEXITSTATUS(status)) {
LOG(ERROR) << "Process exited with code: " << WEXITSTATUS(status);
return WEXITSTATUS(status);
}
return OK;
}
这个问题在插入损坏sdcard时必现,现象跟着卡走,结合上面的分析认为这是sdacrd卡硬件异常导致的系统卡死问题,非软件问题。
客户也认可此结论,但希望这边提供某些方案看能不能避免此问题。
方案讨论
考虑的方案主要有两种:
方案1:现在问题时fsck在进行sdcard check时卡死,一般正常sdcard几秒能就能结束,考虑能否在fsck里加上超时机制,超时自动停止结束,这样即使sdcard无法使用,但系统并不会重启。
fsck_msdos源码位置在/external/fsck_msdos/下,本身没有超时机制,修改有一定风险,目前还需分析具体卡住的原因才能看能不能修改,周期长,不推荐给客户。
方案2:
上层看门狗不对底层vold进行监听
系统确实预留了一个WATCHDOG_ENABLE用来控制看门狗是否需要监听vold进程。
frameworks/base/services/core/java/com/android/server/StorageManagerService.java
/**
* We now talk to vold over Binder, and it has its own internal lock to
* serialize certain calls. All long-running operations have been migrated
* to be async with callbacks, so we want watchdog to fire if vold wedges.
*/
private static final boolean WATCHDOG_ENABLE = true; //WATCHDOG_ENABLE = false;
public StorageManagerService(Context context) {
...
// Add ourself to the Watchdog monitors if enabled.
if (WATCHDOG_ENABLE) {
Watchdog.getInstance().addMonitor(this);
}
...
}
因此,如果 WATCHDOG_ENABLE = false,将不会调用 monitor 方法来检测 vold 是否卡住并且不会触发 SWT。
但是StorageManagerService中的大部分API都是调用vold的。 如果vold卡住了,很多API都不能正常工作,如果进程调用相关的API,仍然存在进程卡死的风险。
2273 mVold.mount(vol.id, vol.mountFlags, vol.mountUserId, new IVoldMountCallback.Stub() { in mount()
2324 mVold.unmount(vol.id); in unmount()
2338 mVold.format(vol.id, "auto"); in format()
2355 mVold.benchmark(volId, new IVoldTaskListener.Stub() { in benchmark()
2395 mVold.partition(diskId, IVold.PARTITION_TYPE_PUBLIC, -1); in partitionPublic()
2409 mVold.partition(diskId, IVold.PARTITION_TYPE_PRIVATE, -1); in partitionPrivate()
2423 mVold.partition(diskId, IVold.PARTITION_TYPE_MIXED, ratio); in partitionMixed()
2506 mVold.forgetPartition(partGuid, fsUuid); in forgetPartition()
2522 mVold.fstrim(flags, new IVoldTaskListener.Stub() { in fstrim()
2570 mVold.runIdleMaint(new IVoldTaskListener.Stub() { in runIdleMaint()
2599 mVold.abortIdleMaint(new IVoldTaskListener.Stub() { in abortIdleMaint()
2774 mVold.moveStorage(from.id, to.id, new IVoldTaskListener.Stub() { in setPrimaryStorageUuid()
2900 return mVold.fdeComplete(); in getEncryptionState()
2921 mVold.fdeCheckPassword(password); in decryptStorage()
2924 mVold.fdeRestart(); in decryptStorage()
2955 mVold.fdeEnable(type, password, 0); in encryptStorage()
2989 mVold.fdeChangePassword(type, password); in changeEncryptionPassword()
3019 mVold.fdeVerifyPassword(password); in verifyEncryptionPassword()
3037 return mVold.fdeGetPasswordType(); in getPasswordType()
3060 mVold.fdeSetField(field, contents); in setField()
3084 return mVold.fdeGetField(field); in getField()
3101 return mVold.isConvertibleToFbe(); in isConvertibleToFBE()
3115 return mVold.supportsCheckpoint(); in supportsCheckpoint()
3134 mVold.startCheckpoint(numTries); in startCheckpoint()
3147 mVold.commitChanges(); in commitChanges()
3156 return mVold.needsCheckpoint(); in needsCheckpoint()
3169 mVold.abortChanges(message, retry); in abortChanges()
3178 return mVold.fdeGetPassword(); in getPassword()
3191 mVold.fdeClearPassword(); in clearPassword()
3204 mVold.createUserKey(userId, serialNumber, ephemeral); in createUserKey()
3215 mVold.destroyUserKey(userId); in destroyUserKey()
此方案会导致StorageManagerService中涉及到vold的调用的API还是存在卡住风险,无法正常完成工作,涉及70+API,逻辑复杂,其中很多API在开机过程中进行调用,风险太大,不建议进行修改。
最终评估方案风险极高,不推荐给客户。
方案三:
修改vold的逻辑,调用fsck时加上超时机制,
目前我们对比了Android T的代码,T上就是这么处理的
https://android-review.googlesource.com/c/platform/system/vold/+/1826726
关键代码:
status_t ForkExecvpTimeout(const std::vector<std::string>& args, std::chrono::seconds timeout) {
int status;
pid_t wait_timeout_pid = fork();
if (wait_timeout_pid == 0) {
pid_t pid = ForkExecvpAsync(args, context);
pid_t timer_pid = fork();
if (timer_pid == 0) {
sleep(timeout.count());
_exit(ETIMEDOUT);
}
pid_t finished = wait(&status);
if (finished == pid) {
kill(timer_pid, SIGTERM);
} else {
kill(pid, SIGTERM);
}
_exit(WEXITSTATUS(status));
}
if (waitpid(wait_timeout_pid, &status, 0) == -1) {
return -errno;
}
if (WEXITSTATUS(status)) {
return WEXITSTATUS(status);
}
return OK;
}
pid_t ForkExecvpAsync(const std::vector<std::string>& args, char* context) {
auto argv = ConvertToArgv(args);
pid_t pid = fork();
if (pid == 0) {
execvp(argv[0], const_cast<char**>(argv.data()));
}
return pid;
}
实现逻辑对比:
1.父进程fork一个子进程执行fsck命令,然后调用waitpid I/F等待子进程完成。
2.这里的问题是如果fsck检查进程卡住了,父进程会一直阻塞在waitpid处。
父进程process1 fork 一个子进程process2 (wait_timeout_pid) 用于超时检测,然后调用waitpid I/F 等待子进程完成。
然后:
process2 fork 一个新的 sub-subprocess3 (pid) 来执行 fsck 命令;
process2再次fork一个新的sub-subprocess4(timer_pid)进行休眠计时;
process2 调用wait I/F 阻塞,等待process3 或process4 中的任何一个完成。
等待过程 3 或 4 完成:
1.场景1(正常):进程3在进程4之前完成
2.场景2(超时45s):进程4在进程3之前完成
kill掉另一个进程,最后,process2结束,返回process1。
此方案可以规避SWT问题,但依然存在以下风险
1.由于vold进程卡在fsck命令上,导致调用vold时StorageManagerService中很多API无法正常使用。 45S后,超时触发,API才能恢复正常。
2.在sdcard挂载过程中会执行fsck检查,如果触发超时,vold会取消挂载,所以手机无法识别sdcard,此时上层应用也无法访问sdcard。