背景:
在做系统开发时候,经常要分析一些死机hang机问题,也被叫做系统冻屏,定屏等。这类问题最直接的表现就是整个系统触摸后跳转啥的没有了任何反应,整个系统像冻住了一样。
针对这类系统冻屏,死机,hang机问题分析的流程其实和系统ANR问题分析基本一样套路,今天就以系统中的自带命令am hang模拟让系统出现死机,冻屏这种现象,然后我们通过trace相关来反剖析出am hang命令相关原理。
模拟死机hang机现象:
具体命令在am -h可以看到如下:
adb shell am -h
Activity manager (activity) commands:
help
Print this help text.
hang [--allow-restart]
Hang the system.
--allow-restart: allow watchdog to perform normal system restart
这里的am中直接提供了hang参数既可以让系统hang机死机
执行命令:
test@test:~/disk2/demos/signal-kill$ adb shell am hang
Hanging the system...
可以看到这个时候执行后有打印Hanging the system…,接下来对手机画面进行相关操作,注意模拟hang机期间不要中断这个命令。
发现桌面点击app后一直停留在如下画面:
尝试按home按键也没有办法返回到桌面,如果在桌面多操作几次发现,整体系统已经没有办法打开任何的app等,操作也没有任何反应。
相关trace分析hang机
这里因为是点击所有app没有反应,所以第一时间想到就是抓取system_server进程的trace,看看这个时候system_server的各个线程到底在干什么?
抓取某个进程的各个线程运行trace,前面已经有分享过相关命令,这里可以采用kill -3 pid来获取对应的trace或者dump。
test@test:~/disk2/demos/signal-kill$ adb shell ps -A | grep system_server
system 548 317 17610920 358184 futex_wait_queue 0 S system_server
test@test:~/disk2/demos/signal-kill$ adb shell kill -3 548
test@test:~/disk2/demos/signal-kill$ adb pull /data/anr/
上面的命令已经把system_server相关的trace导出了,下面就是分析trace
----- pid 548 at 2025-02-25 10:12:37.381511562+0800 -----
Cmd line: system_server
Build fingerprint: 'Android/sdk_phone64_x86_64/emu64x:VanillaIceCream/AP3A.241005.015.A2/eng.test.00000000.000000:eng/test-keys'
ABI: 'x86_64'
Build type: debug
Debug Store: 1,0,3691723::
suspend all histogram: Sum: 13.378ms 99% C.I. 1us-5283.200us Avg: 199.671us Max: 6688us
DALVIK THREADS (200):
"main" prio=5 tid=1 Blocked //处于Blocked状态
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x71b105d8 self=0x7f2f37a7d380
| sysTid=548 nice=-2 cgrp=foreground sched=0/0 handle=0x7f31cbd85d18
| state=S schedstat=( 18007443266 2213062009 56958 ) utm=1532 stm=268 core=3 HZ=100
| stack=0x7ffd0aaf7000-0x7ffd0aaf9000 stackSize=8188KB
| held mutexes=
at com.android.server.am.ActivityManagerService.broadcastIntentWithFeature(ActivityManagerService.java:16310)
- waiting to lock <0x0dc0cc58> (a com.android.server.am.ActivityManagerService) held by thread 183
at android.app.ContextImpl.sendBroadcastAsUser(ContextImpl.java:1507)
at com.android.server.alarm.AlarmManagerService$2.lambda$doAlarm$0(AlarmManagerService.java:1888)
at com.android.server.alarm.AlarmManagerService$2.$r8$lambda$FzGTY398wY7rsveJl5qWIzqCYlQ(AlarmManagerService.java:0)
at com.android.server.alarm.AlarmManagerService$2$$ExternalSyntheticLambda0.run(R8$$SyntheticClass:0)
at android.os.Handler.handleCallback(Handler.java:959)
at android.os.Handler.dispatchMessage(Handler.java:100)
at android.os.Looper.loopOnce(Looper.java:232)
at android.os.Looper.loop(Looper.java:317)
at com.android.server.SystemServer.run(SystemServer.java:973)
at com.android.server.SystemServer.main(SystemServer.java:657)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:580)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:864)
DumpLatencyMs: 21.074
"binder:548_2" prio=5 tid=11 Blocked //处于Blocked状态
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x14007880 self=0x7f2f37aa3760
| sysTid=565 nice=-2 cgrp=foreground sched=0/0 handle=0x7f2ef05508b0
| state=S schedstat=( 2841199991 1278973810 6623 ) utm=251 stm=32 core=3 HZ=100
| stack=0x7f2ef0459000-0x7f2ef045b000 stackSize=990KB
| held mutexes=
at com.android.server.am.ActivityManagerService.getMemoryTrimLevel(ActivityManagerService.java:10430)
- waiting to lock <0x0dc0cc58> (a com.android.server.am.ActivityManagerService) held by thread 183
at com.android.server.job.JobConcurrencyManager.refreshSystemStateLocked(JobConcurrencyManager.java:731)
at com.android.server.job.JobConcurrencyManager.updateCounterConfigLocked(JobConcurrencyManager.java:741)
at com.android.server.job.JobConcurrencyManager.prepareForAssignmentDeterminationLocked(JobConcurrencyManager.java:829)
at com.android.server.job.JobConcurrencyManager.assignJobsToContextsInternalLocked(JobConcurrencyManager.java:791)
at com.android.server.job.JobConcurrencyManager.assignJobsToContextsLocked(JobConcurrencyManager.java:775)
at com.android.server.job.JobSchedulerService.maybeRunPendingJobsLocked(JobSchedulerService.java:4206)
at com.android.server.job.JobSchedulerService.scheduleAsPackage(JobSchedulerService.java:1969)
- locked <0x02a7f83e> (a java.lang.Object)
at com.android.server.job.JobSchedulerService$JobSchedulerStub.enqueue(JobSchedulerService.java:5015)
at android.app.job.IJobScheduler$Stub.onTransact(IJobScheduler.java:238)
at android.os.Binder.execTransactInternal(Binder.java:1505)
at android.os.Binder.execTransact(Binder.java:1444)
DumpLatencyMs: 31.4682
"watchdog.monitor" prio=5 tid=13 Blocked //处于Blocked状态
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x1404fe38 self=0x7f2f37aa8ad0
| sysTid=568 nice=-2 cgrp=foreground sched=0/0 handle=0x7f2e8e4e88b0
| state=S schedstat=( 209480980 7452941 874 ) utm=14 stm=6 core=1 HZ=100
| stack=0x7f2e8e3e5000-0x7f2e8e3e7000 stackSize=1038KB
| held mutexes=
at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:18081)
- waiting to lock <0x0dc0cc58> (a com.android.server.am.ActivityManagerService) held by thread 183
at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:372)
at android.os.Handler.handleCallback(Handler.java:959)
at android.os.Handler.dispatchMessage(Handler.java:100)
at android.os.Looper.loopOnce(Looper.java:232)
at android.os.Looper.loop(Looper.java:317)
at android.os.HandlerThread.run(HandlerThread.java:85)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
DumpLatencyMs: 32.2449
"android.fg" prio=5 tid=14 Blocked //处于Blocked状态
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x14050120 self=0x7f2f37aac270
| sysTid=569 nice=-2 cgrp=foreground sched=0/0 handle=0x7f2e8e3d88b0
| state=S schedstat=( 463072896 1485062249 1728 ) utm=44 stm=1 core=3 HZ=100
| stack=0x7f2e8e2d5000-0x7f2e8e2d7000 stackSize=1038KB
| held mutexes=
at com.android.server.am.ActivityManagerService.broadcastIntentWithFeature(ActivityManagerService.java:16310)
- waiting to lock <0x0dc0cc58> (a com.android.server.am.ActivityManagerService) held by thread 183
at android.app.ContextImpl.sendBroadcastAsUser(ContextImpl.java:1507)
at com.android.server.DropBoxManagerService$DropBoxManagerBroadcastHandler.prepareAndSendBroadcast(DropBoxManagerService.java:322)
at com.android.server.DropBoxManagerService$DropBoxManagerBroadcastHandler.handleMessage(DropBoxManagerService.java:296)
at android.os.Handler.dispatchMessage(Handler.java:107)
at android.os.Looper.loopOnce(Looper.java:232)
at android.os.Looper.loop(Looper.java:317)
at android.os.HandlerThread.run(HandlerThread.java:85)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
上面trace中明显发现整个system_server的大部分线程都是处于Blocked状态,被Blocked原因都是一样的:
- waiting to lock <0x0dc0cc58> (a com.android.server.am.ActivityManagerService) held by thread 183
都是在等lock <0x0dc0cc58> 这个AMS的锁,这个锁是被183这个线程持有的,下面来看看183这个线程堆栈情况
"binder:548_A" prio=5 tid=183 Waiting
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x150c2540 self=0x7f2f37bf66d0
| sysTid=1355 nice=-2 cgrp=foreground sched=0/0 handle=0x7f2dff1988b0
| state=S schedstat=( 1771060311 731910349 3599 ) utm=145 stm=31 core=2 HZ=100
| stack=0x7f2dff0a1000-0x7f2dff0a3000 stackSize=990KB
| held mutexes=
at java.lang.Object.wait(Native method)
- waiting on <0x00cd30be> (a com.android.server.am.ActivityManagerService$11)
at java.lang.Object.wait(Object.java:405)
at java.lang.Object.wait(Object.java:543)
at com.android.server.am.ActivityManagerService.hang(ActivityManagerService.java:8816)
- locked <0x0dc0cc58> (a com.android.server.am.ActivityManagerService)
- locked <0x00cd30be> (a com.android.server.am.ActivityManagerService$11)
at com.android.server.am.ActivityManagerShellCommand.runHang(ActivityManagerShellCommand.java:2293)
at com.android.server.am.ActivityManagerShellCommand.onCommand(ActivityManagerShellCommand.java:333)
at com.android.modules.utils.BasicShellCommandHandler.exec(BasicShellCommandHandler.java:97)
at android.os.ShellCommand.exec(ShellCommand.java:38)
at com.android.server.am.ActivityManagerService.onShellCommand(ActivityManagerService.java:10471)
at android.os.Binder.shellCommand(Binder.java:1230)
at android.os.Binder.onTransact(Binder.java:1043)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:5675)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2812)
at android.os.Binder.execTransactInternal(Binder.java:1505)
at android.os.Binder.execTransact(Binder.java:1444)
DumpLatencyMs: 129.838
明显上面堆栈中可以看到,在执行ActivityManagerService.hang方法时候,一直持有locked <0x0dc0cc58> (a com.android.server.am.ActivityManagerService)这个锁,而且一直处于还一直处于wait状态不释放。
正因为183线程一直持有着这个locked <0x0dc0cc58> (a com.android.server.am.ActivityManagerService)这个锁,导致其他线程要使用AMS这个大锁时候就没办法获取,纷纷处于Blocked状态。
经过上面trace分析已经可以定位到源码位置在ActivityManagerService.hang(ActivityManagerService.java:8816)这里,下面来进行hang方法的相关源码分析。
hang源码剖析
adb shell am hang命令其实也是shell拉起的am进程通过跨进程调用到了system_server的ActivityManagerService.onShellCommand方法。
at com.android.server.am.ActivityManagerShellCommand.runHang(ActivityManagerShellCommand.java:2293)
at com.android.server.am.ActivityManagerShellCommand.onCommand(ActivityManagerShellCommand.java:333)
at com.android.modules.utils.BasicShellCommandHandler.exec(BasicShellCommandHandler.java:97)
at android.os.ShellCommand.exec(ShellCommand.java:38)
at com.android.server.am.ActivityManagerService.onShellCommand(ActivityManagerService.java:10471)
at android.os.Binder.shellCommand(Binder.java:1230)
at android.os.Binder.onTransact(Binder.java:1043)
下面最重要来分析ActivityManagerService.hang方法:
@Override
public void hang(final IBinder who, boolean allowRestart) {
//省略部分
//构造一个DeathRecipient对象,主要目的监听入参who这个binder对象的死亡
//这里who这个binder对象,其实是am命令运行的对应进程的对象,所以am命令进程死亡中断,就会回调这里的binderDied
final IBinder.DeathRecipient death = new DeathRecipient() {
@Override
public void binderDied() {//am hang命令行中断如果中断运行则回调这里binderDied
synchronized (this) {
notifyAll();//这里会进行通知下面的death.wait()可以运行
}
}
};
try {
who.linkToDeath(death, 0);//who对象死亡通知回调到death
} catch (RemoteException e) {
Slog.w(TAG, "hang: given caller IBinder is already dead.");
return;
}
synchronized (this) {//这里会持有AMS的大锁
synchronized (death) {//这里会持有death的锁
while (who.isBinderAlive()) {//循环检测who这个binder对象是否还活着,也就是只要am hang命令没有中断就会一直等待
try {
death.wait();//这里会一直等待,直到上面binder死亡回调,这里才可以不等待
} catch (InterruptedException e) {
}
}
}
}
}
总结ActivityManagerService.hang主要业务有以下几点:
1、am hang命令行客户端会传递一个Binder对象作为入参
2、通过对传递来的Binder对象进行linkToDeath,即对Binder死亡监听,也就可以实现对am hang命令是否中断进行监听
3、持有着AMS的大锁,进行死循环的等待,只要am hang命令进程对应的Binder还存活,那么就会一直持有锁等待
4、一旦am hang命令中断,那么对应binder可以监听到死亡回调,那么会通知唤醒不需要等待,从而释放了AMS的大锁。
总结图如下:
更多framework实战干货,请关注下面“千里马学框架”