android restart cases

Introduction

Android 重启或者说是system server重启,常见的case有以下几种:

  • system server process crash
  • system server watchdog timeout
  • critical process crash

不论system server重启的原因是什么,system server重启后,logcat里,可以看到类似下面的打印。可以查找“system server.*terminated”关键字,来确认system server是否重启了。

01-01 15:48:23.955 E/Zygote  ( 4211): Exit zygote because system server (4525) has terminated

system server process crash

目前随着project treble 的推进,frameworks基本上都是原生的AOSP代码,system server自身出现逻辑错误crash的情况越来越少见了。

system server进程自身crash,logcat里可以看到关键字:FATAL EXCEPTION IN SYSTEM PROCESS,之后是相应的callstack。比如下面这个例子,是TIF里的空指针:

01-01 12:45:11.950 E/AndroidRuntime( 4429): *** FATAL EXCEPTION IN SYSTEM PROCESS: main
01-01 12:45:11.950 E/AndroidRuntime( 4429): java.lang.NullPointerException: Attempt to read from field 'android.media.tv.TvInputInfo com.android.server.tv.TvInputManagerService$TvInputState.info' on a null object reference
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.server.tv.TvInputManagerService$TvInputState.-get0(TvInputManagerService.java)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.server.tv.TvInputManagerService.setStateLocked(TvInputManagerService.java:866)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.server.tv.TvInputManagerService.-wrap19(TvInputManagerService.java)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.server.tv.TvInputManagerService$HardwareListener.onStateChanged(TvInputManagerService.java:2757)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.server.tv.TvInputHardwareManager$ListenerHandler.handleMessage(TvInputHardwareManager.java:1152)  
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at android.os.Handler.dispatchMessage(Handler.java:102)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at android.os.Looper.loop(Looper.java:154)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.server.SystemServer.run(SystemServer.java:355)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.server.SystemServer.main(SystemServer.java:222)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at java.lang.reflect.Method.invoke(Native Method)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:889)
01-01 12:45:11.950 E/AndroidRuntime( 4429):     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:779)

可以根据logcat里的callstack找到出问题的函数调用,进而分析原因。

system server watchdog timeout

Android watchdog 用来实时监测系统服务是否正常运行,实时监测SystemServer进程,防止系统死锁。Watchdog本身继承Thread,是一个线程类,监测任务运行在独立的线程中。需要加入Watchdog监控的服务必须实现Monitor接口。默认的timeout是60s,分成两个waited_half。如果两个waited_half,monitor还没有返回。则认为系统hang住了,watchdog会dump 信息,kill system server。

Watchdog timeout的case大概有以下几种:

  • system server自身死锁
  • 其它进程block住system server
  • system server 进程得不到调度

Watchdog timout通常都伴随着ANR,分析时除了需要logcat之外,还需要trace文件。

Other process block system server

System server自身存在死锁,目前基本上没有。其它进程block住system serve,下面的这个例子,logcat里可以看到block的thread 和 callstack

01-01 15:48:23.144 W/Watchdog( 4525): *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ui thread (android.ui), Blocked in handler on display thread (android.display), Blocked in handler on ActivityManager (ActivityManager)
01-01 15:48:23.144 D/WindowManager( 4525): interceptKeyTq keycode=82 interactive=true keyguardActive=false policyFlags=2b000000
01-01 15:48:23.144 W/Watchdog( 4525): foreground thread stack trace:
01-01 15:48:23.145 W/Watchdog( 4525):     at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:21685)
01-01 15:48:23.146 W/Watchdog( 4525):     at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:182)
01-01 15:48:23.146 W/Watchdog( 4525):     at android.os.Handler.handleCallback(Handler.java:751)
01-01 15:48:23.146 W/Watchdog( 4525):     at android.os.Handler.dispatchMessage(Handler.java:95)
01-01 15:48:23.146 W/Watchdog( 4525):     at android.os.Looper.loop(Looper.java:154)
01-01 15:48:23.146 W/Watchdog( 4525):     at android.os.HandlerThread.run(HandlerThread.java:61)
01-01 15:48:23.146 W/Watchdog( 4525):     at com.android.server.ServiceThread.run(ServiceThread.java:46)
01-01 15:48:23.146 W/Watchdog( 4525): ui thread stack trace:
01-01 15:48:23.147 W/Watchdog( 4525):     at com.android.server.am.ActivityManagerService.dispatchUidsChanged(ActivityManagerService.java:4227)
01-01 15:48:23.147 W/Watchdog( 4525):     at com.android.server.am.ActivityManagerService.-wrap7(ActivityManagerService.java)
01-01 15:48:23.148 W/Watchdog( 4525):     at com.android.server.am.ActivityManagerService$UiHandler.handleMessage(ActivityManagerService.java:1770)
01-01 15:48:23.148 W/Watchdog( 4525):     at android.os.Handler.dispatchMessage(Handler.java:102)
01-01 15:48:23.148 W/Watchdog( 4525):     at android.os.Looper.loop(Looper.java:154)
01-01 15:48:23.148 W/Watchdog( 4525):     at android.os.HandlerThread.run(HandlerThread.java:61)
01-01 15:48:23.148 W/Watchdog( 4525):     at com.android.server.ServiceThread.run(ServiceThread.java:46)

然后,在Traces文件里,具体查看System server的call stack信息,可以看到android.fg这个thread在等0x0c60a278这个锁,而这个锁由thread 98 lock住。

"android.fg" prio=5 tid=16 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x12cf06a0 self=0x724e3fa000
  | sysTid=4544 nice=0 cgrp=default sched=0/0 handle=0x7241bff450
  | state=S schedstat=( 522077430 728705228 3951 ) utm=34 stm=18 core=1 HZ=100
  | stack=0x7241afd000-0x7241aff000 stackSize=1037KB
  | held mutexes=   
  at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:21685) 
  - waiting to lock <0x0c60a278> (a com.android.server.am.ActivityManagerService) held by thread 98
  at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:182)
  at android.os.Handler.handleCallback(Handler.java:751)
  at android.os.Handler.dispatchMessage(Handler.java:95)
  at android.os.Looper.loop(Looper.java:154)
  at android.os.HandlerThread.run(HandlerThread.java:61)
  at com.android.server.ServiceThread.run(ServiceThread.java:46)

由下面的callstack看到,thread 98 持有0x0c60a278这个锁,同时在等thread 84 持有的锁0x0318fd3。

"Binder:4525_20" prio=5 tid=98 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x1341c5e0 self=0x723cb50000
  | sysTid=26192 nice=-2 cgrp=default sched=0/0 handle=0x723a435450
  | state=S schedstat=( 56631669726 22599315657 94187 ) utm=4711 stm=952 core=1 HZ=100
  | stack=0x723a33b000-0x723a33d000 stackSize=1005KB
  | held mutexes=
  at com.android.server.wm.WindowManagerService.notifyAppRelaunchesCleared(WindowManagerService.java:10501)
  - waiting to lock <0x0318fd3a> (a java.util.HashMap) held by thread 84
  at com.android.server.am.ActivityStack.cleanUpActivityLocked(ActivityStack.java:3816)
  at com.android.server.am.ActivityStack.removeHistoryRecordsForAppLocked(ActivityStack.java:4246)
  at com.android.server.am.ActivityStack.handleAppDiedLocked(ActivityStack.java:4988)
  at com.android.server.am.ActivityStackSupervisor.handleAppDiedLocked(ActivityStackSupervisor.java:1740)
  at com.android.server.am.ActivityManagerService.handleAppDiedLocked(ActivityManagerService.java:5101)
  at com.android.server.am.ActivityManagerService.appDiedLocked(ActivityManagerService.java:5250)
  at com.android.server.am.ActivityManagerService$AppDeathRecipient.binderDied(ActivityManagerService.java:1485)
  - locked <0x0c60a278> (a com.android.server.am.ActivityManagerService)
  at android.os.BinderProxy.sendDeathNotice(Binder.java:688)

下面的call stack看到 thread 84持有锁0x0318fd3,通过binder与surfaceflinger通过,调用的是surfacecreate接口。

"Binder:4525_12" prio=5 tid=84 Native
  | group="main" sCount=1 dsCount=0 obj=0x135f9550 self=0x7241da4800
  | sysTid=26177 nice=-10 cgrp=default sched=0/0 handle=0x723b20b450
  | state=S schedstat=( 58401616494 22320070685 93186 ) utm=4880 stm=960 core=1 HZ=100
  | stack=0x723b111000-0x723b113000 stackSize=1005KB
  | held mutexes=
  kernel: __switch_to+0x74/0x8c
  kernel: binder_thread_read+0xa2c/0xfe8
  kernel: binder_ioctl_write_read+0x1b4/0x298
  kernel: binder_ioctl+0x460/0x654
  kernel: do_vfs_ioctl+0x35c/0x588
  kernel: SyS_ioctl+0x8c/0xa4
  kernel: cpu_switch_to+0x48/0x4c
  native: #00 pc 000000000006a438  /system/lib64/libc.so (__ioctl+4)
  native: #01 pc 000000000001f8c8  /system/lib64/libc.so (ioctl+144)
  native: #02 pc 0000000000055534  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+260)
  native: #03 pc 0000000000056318  /system/lib64/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+352)
  native: #04 pc 000000000004b1e0  /system/lib64/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+72)
  native: #05 pc 000000000007b078  /system/lib64/libgui.so (???)
  native: #06 pc 00000000000887f4  /system/lib64/libgui.so (_ZN7android21SurfaceComposerClient13createSurfaceERKNS_7String8Ejjij+80)
  native: #07 pc 00000000000e52f8  /system/lib64/libandroid_runtime.so (???)
  native: #08 pc 0000000000c45e84  /system/framework/arm64/boot-framework.oat (Java_android_view_SurfaceControl_nativeCreate__Landroid_view_SurfaceSession_2Ljava_lang_String_2IIII+208)
  at android.view.SurfaceControl.nativeCreate(Native method)
  at android.view.SurfaceControl.<init>(SurfaceControl.java:302)
  at com.android.server.wm.WindowSurfaceController$SurfaceControlWithBackground.<init>(WindowSurfaceController.java:794)
  at com.android.server.wm.WindowSurfaceController.<init>(WindowSurfaceController.java:95)
  at com.android.server.wm.WindowStateAnimator.createSurfaceLocked(WindowStateAnimator.java:749)
  at com.android.server.wm.WindowManagerService.createSurfaceControl(WindowManagerService.java:3136)
  at com.android.server.wm.WindowManagerService.relayoutWindow(WindowManagerService.java:2944)
  - locked <0x0318fd3a> (a java.util.HashMap)
  at com.android.server.wm.Session.relayout(Session.java:215)
  at android.view.IWindowSession$Stub.onTransact(IWindowSession.java:286)
  at com.android.server.wm.Session.onTransact(Session.java:136)
  at android.os.Binder.execTransact(Binder.java:565)

到这里,可以分析出system server通过binder与surfaceflinger通信,而surfaceflinger进程没有及时返回,进而block住了system server。触发了watcdog。

Trace文件里,同样有surfaceflinger进程的callstack,我们可以分析下surfaceflinger进程在做什么。可以看到surfaceflinger 的binder 1线程在 pthread_cond_wait ,在等某个锁。有可能是surfacelinger自身死锁问题。

"Binder:4173_1" sysTid=4182
  #00 pc 000000000001bcec  /system/lib64/libc.so (syscall+28)
  #01 pc 0000000000067708  /system/lib64/libc.so (pthread_cond_wait+96)
  #02 pc 000000000006215c  /system/lib64/libsurfaceflinger.so
  #03 pc 0000000000023c48  /system/lib64/libsurfaceflinger.so
  #04 pc 000000000007aa08  /system/lib64/libgui.so (_ZN7android23BnSurfaceComposerClient10onTransactEjRKNS_6ParcelEPS1_j+208)
  #05 pc 0000000000049e24  /system/lib64/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+132)
  #06 pc 0000000000055ba8  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+980)
  #07 pc 0000000000055718  /system/lib64/libbinder.so (_ZN7android14IPCThreadState20getAndExecuteCommandEv+156)
  #08 pc 0000000000055dc0  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14joinThreadPoolEb+72)
  #09 pc 0000000000072dac  /system/lib64/libbinder.so
  #10 pc 000000000001242c  /system/lib64/libutils.so (_ZN7android6Thread11_threadLoopEPv+272)
  #11 pc 0000000000067fb0  /system/lib64/libc.so (_ZL15__pthread_startPv+196)
  #12 pc 000000000001d980  /system/lib64/libc.so (__start_thread+16)

除了surfaceflinger会block systemserver,其它进程如mediaserver等,也常block system server。

critical process crash

某些native 关键进程crash,也导致system 跟着重启。常见的是surfaceflinger。native 进程crash,会生成tombstone文件。原因多种多样,可以根据tombstone分析。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值