以下技术在盈鹏飞嵌入式的A40I/T3核心板(CoM-X40I)和主控板(SBC-X40I)上经过验证,欢迎 交流! CoM-X40I核心板和SBC-X40I主板见下图:.
1. 前言
这里总结几种系统异常时,常用的几种调试方法
2. Debuggerd
Debuggerd 和 echo t > /proc/sysrq-trigger
一起调试进程空间和内核空间死锁、睡眠问题
3.Kill命令
Kill -6 可以打印所有进程的core dump backtrace,
数据会保存到/data/tombstones/tombstone_0{0..9}递增文件中,
同时也会打印一份保存到 data/anr/traces.txt文件中。
其效果和debuggerd 打印的core dump结果一致;
Kill -3 可以打印zygote进程空间的core dump backtrac,
数据只保存到/data/anr/traces.txt文件中,
类似于AMS 中watchdog服务检查到ANR后打出的traces.txt结果一致;
4.Strace
-
usage: strace [
-CdffhiqrtttTvVxxy] [
-I n] [
-e expr]...
-
[
-a
column] [
-o file] [
-s strsize] [
-P path]...
-
-p pid...
/ [
-D] [
-E var
=val]... [
-u username] PROG [ARGS]
-
or: strace
-c[df] [
-I n] [
-e expr]... [
-O overhead] [
-S sortby]
-
-p pid...
/ [
-D] [
-E var
=val]... [
-u username] PROG [ARGS]
-
-c
-- count time, calls, and errors for each syscall and report summary
-
-C
-- like -c but also print regular output
-
-w
-- summarise syscall latency (default is system time)
-
-d
-- enable debug output to stderr
-
-D
-- run tracer process as a detached grandchild, not as parent
-
-f
-- follow forks, -ff -- with output into separate files
-
-i
-- print instruction pointer at time of syscall
-
-q
-- suppress messages about attaching, detaching, etc.
-
-r
-- print relative timestamp, -t -- absolute timestamp, -tt -- with usecs
-
-T
-- print time spent in each syscall
-
-v
-- verbose mode: print unabbreviated argv, stat, termios, etc. args
-
-x
-- print non-ascii strings in hex, -xx -- print all strings in hex
-
-y
-- print paths associated with file descriptor arguments
-
-h
-- print help message, -V -- print version
-
-a
column
-- alignment COLUMN for printing syscall results (default 40)
-
-b execve
-- detach on this syscall
-
-e expr
-- a qualifying expression: option=[!]all or option=[!]val1[,val2]...
-
options: trace, abbrev, verbose, raw, signal, read, write
-
-I interruptible
--
-
1:
no signals
are blocked
-
2: fatal signals
are blocked while decoding syscall (
default)
-
3: fatal signals
are always blocked (
default if
'-o FILE PROG')
-
4: fatal signals
and SIGTSTP (
^Z)
are always blocked
-
(useful
to make
'strace -o FILE PROG'
not stop
on
^Z)
-
-o file
-- send trace output to FILE instead of stderr
-
-O overhead
-- set overhead for tracing syscalls to OVERHEAD usecs
-
-p pid
-- trace process with process id PID, may be repeated
-
-s strsize
-- limit length of print strings to STRSIZE chars (default 32)
-
-S sortby
-- sort syscall counts by: time, calls, name, nothing (default time)
-
-u username
-- run command as username handling setuid and/or setgid
-
-E var
=val
-- put var=val in the environment for command
-
-E var
-- remove var from the environment for command
-
-P path
-- trace accesses to path
strace -Ff -p 1364 -T
strace -Ff -p 1364 -T -r
strace -Ff -p 1364 -T -t
或者 strace -Ff -p 1364 -T -tt
strace -Ff -p 1364 -T -tt -o /data/strace.log
strace -Ff -p 1364 -c 系统调用耗时
strace -Ff -p 1364 -c -w等待系统调用耗时
strace -Ff -p 1364 -y -tt -T
5. 应用进程ANR
(1)首先通过strace -fF -p {$PID} 确认到具体的线程ANR状态
(2)通过debuggerd -b {$PID} 确认线程backtrace栈状态
(3)异步等待 ANR线程A:A1在等待同一个进程空间的线程A:B1处理任务,再通过strace追踪线程状态;
(4)同步睡眠 ANR线程A:A1在等待一个锁,检查锁被哪个线程占用;
(5)系统调用阻塞 ANR线程A:A1在系统调用中发生睡眠,打印出进程在内核空间的栈分析系统调用睡眠原因
(6)进程间通信等待 ANR线程A:A1在进程间通信binder过程中睡眠,
通过当前进程proc的binder线程状态确认线程等待关系,
例如线程A:A1等待线程B:B1,通过strace或者debuggerd确认线程B:B1状态,
对B:B1 的分析同样要去判断是否发生异步等待、同步睡眠、统调用阻塞和进程间通信等待
6. Monkey稳定性问题
monkey问题排查思路,monkey测试停止,无非有两种情况:
- 系统异常重启;
- 内核内存回收oom kill掉monkey(内存泄漏)
(1)android场景下,一般都是a情况,针对a情况,有很多类型:
1). 系统native重要进程abort掉,父进程init进程kill掉所有子进程,重启系统;
2). system server watchdog 检测到ANR,kill掉system server,zygote检测到system server子进程退出,自己kill掉自己,init检测到子进程ygote退出后,kill掉所有的子进程重启;
3). system server 进程空间线程发生异常abort掉,走了2)的流程
(2)排查此类问题,首先要从后台log中,检查a情况是否发生,通过搜索关键字
AndroidRuntime START com.android.internal.os.ZygoteInit
如果关键字发生两次以上,说明系统发生了重启,确认了a类问题后,仍需进一步确认1)、2)、3)三类情况中的哪一种,方法如下:
针对1)类问题,搜索一下关键字,然后反向搜索,确认是否是系统native进程例如surfaceflinger发生异常; ServiceManager( 1584): service 'display' died
ServiceManager( 1584): service 'usagestats' died
ServiceManager( 1584): service 'batterystats' died
针对2)类问题,执行搜索关键字: WATCHDOG KILLING SYSTEM PROCESS
针对3)类问题,执行搜索关键字: system_server
(3)现场问题分析注意事项:
首先要在log文件中,确认Zygote 和 SystemServer进程pid,
然后才能去检索第一现场附件的log,一旦系统出现多次重启,
很容易迷失在log中。
系统Zygote初始化关键字:
01-01 08:01:53.770 D/AndroidRuntime( 1590): >>>>>> AndroidRuntime START com.android.internal.os.ZygoteInit <<<<<< Zygote初始化system server关键字: 01-01 08:02:01.210 I/dalvikvm( 1590): System server process 2369 has been created 01-01 08:02:01.220 I/SystemServer( 2369): start SystemServer main :16606