此题是 上层发生SWT 喂狗超时 hang detect 主动call 重启
我司测试反馈跑monkey发生过重启现象
LOG 中有明显的db.fatal.00.KE DB文件, 解压DB, 在exp_main 里面可以看到:
Flavor Info: 'None'
Exception Log Time:[Tue Nov 19 18:12:25 CST 2019] [17.91493]
Exception Class: Kernel (KE)
PC is at [<ffffff80087db8b4>] hang_detect_thread+0x20c/0x338
LR is at [<ffffff80087db8ac>] hang_detect_thread+0x204/0x338
Current Executing Process:
[hang_detect, 196][kthreadd, 2]
Backtrace:
[<ffffff800808a798>] bug_handler.part.2+0x40/0x80
[<ffffff800808a7fc>] bug_handler+0x24/0x30
[<ffffff80080829e8>] brk_handler+0x88/0xd0
[<ffffff8008081704>] do_debug_exception+0x94/0x148
[<ffffff800808470c>] el1_dbg+0x18/0x8c
[<ffffff80080c4014>] kthread+0xec/0x100
[<ffffff8008084fa0>] ret_from_fork+0x10/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
从exp_main log种可看出此题是hang detect 主动call bug, 即hang detect 检测到android 上层或底层已经卡住,主动重启手机。
查看SYS_KERNEL_LOG hang detect 信息:
11-19 18:11:58.461 <6>[12513.115747] (0)[196:hang_detect][Hang_Detect] hang_detect thread counts down -1:14, status 1.
hang_detect counts 为14 (tick 420s, system_server 喂狗间隔为30s)说明420s前上层可能卡主即有SWT发生,
查看mtklog 有看到mtklog\data_aee_exp\db.fatal.00.SWT 此题可以定性为上层卡主引起的重启。
那么接下来分析SWT db
SWT信息如下:
Flavor Info: 'None'
Exception Log Time:[Tue Nov 19 18:04:06 CST 2019] [12041.303324]
Exception Class: SWT
Exception Type: system_server_watchdog
Current Executing Process:
system_server
Trigger time:[2019-11-19 18:04:06.294283] pid:855
Backtrace:
Process: system_server
Subject: Blocked in monitor com.freeme.server.am.FreemeActivityManagerService on foreground thread (android.fg)
Build: DOOV//Q6401:8.1.0/O11019/1572577497:user/release-keys
打开SWT_JBT_TRACES文件 查看该时间点system_server的trace
查看main线程 被tid=123的线程blocked
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x73a44798 self=0x6f598c3a00
| sysTid=855 nice=-2 cgrp=default sched=0/0 handle=0x6fdee5f9a8
| state=S schedstat=( 67160603057 21558724792 225928 ) utm=4409 stm=2306 core=2 HZ=100
| stack=0x7ffca6c000-0x7ffca6e000 stackSize=8MB
| held mutexes=
at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:20350)
- waiting to lock <0x0e37dba6> (a com.freeme.server.am.FreemeActivityManagerService) held by thread 123
at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4186)
at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4176)
at com.android.server.BatteryService$12.run(BatteryService.java:822)
at android.os.Handler.handleCallback(Handler.java:790)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loop(Looper.java:164)
at com.android.server.SystemServer.run(SystemServer.java:450)
at com.android.server.SystemServer.main(SystemServer.java:290)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:846)
继续看tid=123的线程是什么状态
"Binder:855_1A" prio=5 tid=123 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x14a017a0 self=0x6f3f7d4c00
| sysTid=8128 nice=-10 cgrp=default sched=0/0 handle=0x6f350d24f0
| state=S schedstat=( 6537215621 1714416972 12079 ) utm=498 stm=154 core=5 HZ=100
| stack=0x6f34fd8000-0x6f34fda000 stackSize=1005KB
| held mutexes=
kernel: (couldn't read /proc/self/task/8128/stack)
native: #00 pc 000000000006a824 /system/lib64/libc.so (__ioctl+4)
native: #01 pc 00000000000240a0 /system/lib64/libc.so (ioctl+136)
native: #02 pc 0000000000054b50 /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+260)
native: #03 pc 00000000000558dc /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+360)
native: #04 pc 0000000000055614 /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+224)
native: #05 pc 000000000004c318 /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72)
native: #06 pc 000000000011f12c /system/lib64/libandroid_runtime.so (???)
native: #07 pc 000000000091b5b4 /system/framework/arm64/boot-framework.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+196)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:764)
at android.app.IActivityController$Stub$Proxy.activityStarting(IActivityController.java:172)
at com.android.server.am.ActivityStarter.startActivity(ActivityStarter.java:460)
at com.android.server.am.ActivityStarter.startActivityLocked(ActivityStarter.java:289)
at com.android.server.am.ActivityStarter.startActivityMayWait(ActivityStarter.java:900)
- locked <0x0e37dba6> (a com.freeme.server.am.FreemeActivityManagerService)
at com.android.server.am.ActivityManagerService.startActivityAsUser(ActivityManagerService.java:4791)
at com.android.server.am.ActivityManagerService.startActivity(ActivityManagerService.java:4727)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:121)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3062)
at android.os.Binder.execTransact(Binder.java:697)
该线程状态为native,并且callstack中含有
IPCThreadState::talkWithDriver
IPCThreadState::waitForResponse
说明binder的对端有卡住
查看SYS_BINDER_INFO 看下对端是什么进程
incoming transaction 23437272: ffffffc01a730a00 from 855:8128 to 21019:21031 code 1 flags 10 pri 0:110 r1 node 20769437 size 440:0 data ffffff8016000008,对端为21019:21031
可以看出对端为21019:21031
查看SYS_PROCESSES_AND_THREADS 进程信息
u:r:shell:s0 shell 21019 21031 5300 3708136 90744 wait_woken 7eff2fa890 S 29 -10 - 0 fg 64 0 com.android.commands.monkey Binder:21019_2
查看tid=21031的进程信息
"Binder:21019_2" prio=5 tid=9 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12e00950 self=0x7e7c625600
| sysTid=21031 nice=-10 cgrp=default sched=0/0 handle=0x7e736454f0
| state=S schedstat=( 42917455 13946540 59 ) utm=2 stm=1 core=5 HZ=100
| stack=0x7e7354b000-0x7e7354d000 stackSize=1005KB
| held mutexes=
kernel: __switch_to+0x8c/0xb0
kernel: wait_woken+0x4c/0xa8
kernel: n_tty_write+0x3a8/0x460
kernel: tty_write+0x188/0x2b0
kernel: __vfs_write+0x34/0xf8
kernel: vfs_write+0x94/0x1a8
kernel: SyS_write+0x5c/0xc8
kernel: el0_svc_naked+0x34/0x38
native: #00 pc 000000000006b890 /system/lib64/libc.so (write+8)
native: #01 pc 0000000000032024 /system/lib64/libjavacore.so (???)
native: #02 pc 00000000002371c4 /system/framework/arm64/boot-core-libart.oat (Java_libcore_io_Linux_writeBytes__Ljava_io_FileDescriptor_2Ljava_lang_Object_2II+196)
at libcore.io.Linux.writeBytes(Native method)
at libcore.io.Linux.write(Linux.java:288)
at libcore.io.BlockGuardOs.write(BlockGuardOs.java:345)
at libcore.io.IoBridge.write(IoBridge.java:553)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
- locked <0x0ffc3d7e> (a java.io.BufferedOutputStream)
at java.io.PrintStream.write(PrintStream.java:503)
- locked <0x04cd5032> (a java.io.PrintStream)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
- locked <0x0696c2df> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
at java.io.PrintStream.write(PrintStream.java:554)
- locked <0x04cd5032> (a java.io.PrintStream)
at java.io.PrintStream.print(PrintStream.java:698)
at java.io.PrintStream.println(PrintStream.java:835)
- locked <0x04cd5032> (a java.io.PrintStream)
at com.android.commands.monkey.Logger$1.println(Logger.java:27)
at com.android.commands.monkey.Monkey$ActivityController.activityStarting(Monkey.java:272)
at android.app.IActivityController$Stub.onTransact(IActivityController.java:62)
at android.os.Binder.execTransact(Binder.java:697)
如上对端为monkey,比较特殊,用互接触不到 可无需关注
若继续深追此题,要check 发生swt前30s的io流情况和cpu情况,为什么monkey写入数据慢
可通过MTK perlyzer 工具 解析SYS_BLOCKIO,查看发生SWT前30s的io情况和cpu情况