一. ANR概述
什么是ANR
ANR,是“Application Not Responding”的缩写,即“应用程序无响应”。在Android中,如果应用程序主线程在超时时间内对输入事件没有处理完毕,或者对特定操作没有执行完毕,就会出现ANR。对于输入事件没有处理完毕产生的ANR,Android会显示一个对话框,提示用户当前应用程序没有响应,用户可以选择继续等待或者关闭这个应用程序。
ANR类型
ANR一般有以下几种场景:
二.ANR原理
ANR触发流程
ANR是一套监控Android应用响应是否及时的机制,可以把发生ANR比作是引爆炸弹,那么整个流程包含三部分组成:
1.埋定时炸弹:中控系统(system_server进程)启动倒计时,在规定时间内如果目标(应用进程)没有干完所有的活,则中控系统会定向炸毁(杀进程)目标。
2.拆炸弹:在规定的时间内干完工地的所有活,并及时向中控系统报告完成,请求解除定时炸弹,则幸免于难。
3.引爆炸弹:中控系统立即封装现场,抓取快照,搜集目标执行慢的罪证(traces),便于后续的案件侦破(调试分析),最后是炸毁目标。
下面通过service来讲解整个ANR的触发流程,其他BroadcastReceiver、ContentProvider触发流程大同小异,可自行了解。
service超时机制
下面来看看埋炸弹与拆炸弹在整个服务启动(startService)过程所处的环节。
图解1:
1.客户端(App进程)向中控系统(system_server进程)发起启动服务的请求
2.中控系统派出一名空闲的通信员(binder_1线程)接收该请求,紧接着向组件管家(ActivityManager线程)发送消息,埋下定时炸弹
3.通讯员1号(binder_1)通知工地(service所在进程)的通信员准备开始干活
4.通讯员3号(binder_3)收到任务后转交给包工头(main主线程),加入包工头的任务队列(MessageQueue)
5.包工头经过一番努力干完活(完成service启动的生命周期)
6.包工头执行完成后,立刻向中控系统汇报工作已完成
7.中控系统的通讯员2号(binder_2)收到包工头的完工汇报后,立刻拆除炸弹。如果在炸弹倒计时结束前拆除炸弹则相安无事,否则会引发爆炸(触发ANR)
埋炸弹
在Service进程attach到system_server进程的过程中会调用realStartServiceLocked()方法来埋下炸弹.
2.3.1 AS.realStartServiceLocked
private final void realStartServiceLocked(ServiceRecord r, ProcessRecord app, boolean execInFg) throws RemoteException {
...
//发送delay消息(SERVICE_TIMEOUT_MSG)
bumpServiceExecutingLocked(r, execInFg, "create");
try {
...
//最终执行服务的onCreate()方法
app.thread.scheduleCreateService(r, r.serviceInfo,
mAm.compatibilityInfoForPackageLocked(r.serviceInfo.applicationInfo),
app.repProcState);
} catch (DeadObjectException e) {
mAm.appDiedLocked(app);
throw e;
} finally {
...
}
}
2.3.2 AS.bumpServiceExecutingLocked
private final void bumpServiceExecutingLocked(ServiceRecord r, boolean fg, String why) {
...
scheduleServiceTimeoutLocked(r.app);
}
void scheduleServiceTimeoutLocked(ProcessRecord proc) {
if (proc.executingServices.size() == 0 || proc.thread == null) {
return;
}
long now = SystemClock.uptimeMillis();
Message msg = mAm.mHandler.obtainMessage(
ActivityManagerService.SERVICE_TIMEOUT_MSG);
msg.obj = proc;
//当超时后仍没有remove该SERVICE_TIMEOUT_MSG消息,则执行service Timeout流程
mAm.mHandler.sendMessageAtTime(msg,
proc.execServicesFg ? (now+SERVICE_TIMEOUT) : (now+ SERVICE_BACKGROUND_TIMEOUT));
}
该方法的主要工作发送delay消息(SERVICE_TIMEOUT_MSG). 炸弹已埋下。
拆炸弹
那么什么时候会拆除这颗炸弹的引线呢? 经过Binder等层层调用进入目标进程的主线程handleCreateService()的过程.
2.4.1 AT.handleCreateService
private void handleCreateService(CreateServiceData data) {
...
java.lang.ClassLoader cl = packageInfo.getClassLoader();
Service service = (Service) cl.loadClass(data.info.name).newInstance();
...
try {
//创建ContextImpl对象
ContextImpl context = ContextImpl.createAppContext(this, packageInfo);
context.setOuterContext(service);
//创建Application对象
Application app = packageInfo.makeApplication(false, mInstrumentation);
service.attach(context, this, data.info.name, data.token, app,
ActivityManagerNative.getDefault());
//调用服务onCreate()方法
service.onCreate();
//拆除炸弹引线
ActivityManagerNative.getDefault().serviceDoneExecuting(
data.token, SERVICE_DONE_EXECUTING_ANON, 0, 0);
} catch (Exception e) {
...
}
}
在这个过程会创建目标服务对象,以及回调onCreate()方法, 紧接再次经过多次调用回到system_server来执行serviceDoneExecuting.
2.4.2 AS.serviceDoneExecutingLocked
private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying, boolean finishing) {
...
if (r.executeNesting <= 0) {
if (r.app != null) {
r.app.execServicesFg = false;
r.app.executingServices.remove(r);
if (r.app.executingServices.size() == 0) {
//当前服务所在进程中没有正在执行的service
mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);
...
}
...
}
该方法的主要工作是当service启动完成,则移除服务超时消息SERVICE_TIMEOUT_MSG。
引爆炸弹
在system_server进程中有一个Handler线程, 名叫”ActivityManager”.当倒计时结束便会向该Handler线程发送 一条信息SERVICE_TIMEOUT_MSG。
2.5.1 MainHandler.handleMessage
final class MainHandler extends Handler {
public void handleMessage(Message msg) {
switch (msg.what) {
case SERVICE_TIMEOUT_MSG: {
...
mServices.serviceTimeout((ProcessRecord)msg.obj);
} break;
...
}
...
}
}
2.5.2 AS.serviceTimeout
void serviceTimeout(ProcessRecord proc) {
String anrMessage = null;
synchronized(mAm) {
if (proc.executingServices.size() == 0 || proc.thread == null) {
return;
}
final long now = SystemClock.uptimeMillis();
final long maxTime = now -
(proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
ServiceRecord timeout = null;
long nextTime = 0;
for (int i=proc.executingServices.size()-1; i>=0; i--) {
ServiceRecord sr = proc.executingServices.valueAt(i);
if (sr.executingStart < maxTime) {
timeout = sr;
break;
}
if (sr.executingStart > nextTime) {
nextTime = sr.executingStart;
}
}
if (timeout != null && mAm.mLruProcesses.contains(proc)) {
Slog.w(TAG, "Timeout executing service: " + timeout);
StringWriter sw = new StringWriter();
PrintWriter pw = new FastPrintWriter(sw, false, 1024);
pw.println(timeout);
timeout.dump(pw, " ");
pw.close();
mLastAnrDump = sw.toString();
mAm.mHandler.removeCallbacks(mLastAnrDumpClearer);
mAm.mHandler.postDelayed(mLastAnrDumpClearer, LAST_ANR_LIFETIME_DURATION_MSECS);
anrMessage = "executing service " + timeout.shortName;
}
}
if (anrMessage != null) {
//当存在timeout的service,则执行appNotResponding
mAm.appNotResponding(proc, null, null, false, anrMessage);
}
}
当出现ANR时,都是调用到AMS.appNotResponding()方法。
Input类型ANR
当应用程序的窗口处于活动状态并且能够接收输入事件(例如按键事件、触摸事件等)时,系统底层上报的事件就会被 InputDispatcher 分发给该应用。开发者通常根据需要接收并处理不同的事件,或者创建自定义的视图控件来处理事件。
InputDispatcher 运行在 system_server 进程的一个子线程中,每当接收到一个新的输入事件,InputDispatcher 就会检测前一个已经发给应用程序的输入事件是否已经处理完毕(InBoundQueue, OutBoundQueue,WaitQueue),如果超时,会通过一系列的回调通知AMS报告ANR发生,需要注意的是,产生这种 ANR 的前提是要有输入事件,如果没有输入事件, 即使主线程阻塞了也不会报告 ANR。从设计的角度看,此时系统会推测用户没有继续点击该应用,寄希望于一段时间后阻塞会自行消失,因此会暂时“隐瞒不报”。
此类 ANR 发生时的关键字是:”Input event dispatching timed out sending to”,其中会有多种Reason,此类 ANR的超时时间默认为5秒,小内存设备通常会根据情况做适当的延长调整。
ANR处理流程
appNotResponding处理流程
final void appNotResponding(ProcessRecord app, ActivityRecord activity, ActivityRecord parent, boolean aboveSystem, final String annotation) {
...
updateCpuStatsNow(); //第一次 更新cpu统计信息,目前京东崩溃统计没有event.log信息
synchronized (this) {
//PowerManager.reboot() 会阻塞很长时间,因此忽略关机时的ANR
if (mShuttingDown) {
return;
} else if (app.notResponding) {
return;
} else if (app.crashing) {
return;
}
//记录ANR到EventLog
EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid,
app.processName, app.info.flags, annotation);
// 将当前进程添加到firstPids
firstPids.add(app.pid);
int parentPid = app.pid;
//将system_server进程添加到firstPids
if (MY_PID != app.pid && MY_PID != parentPid) firstPids.add(MY_PID);
for (int i = mLruProcesses.size() - 1; i >= 0; i--) {
ProcessRecord r = mLruProcesses.get(i);
if (r != null && r.thread != null) {
int pid = r.pid;
if (pid > 0 && pid != app.pid && pid != parentPid && pid != MY_PID) {
if (r.persistent) {
firstPids.add(pid); //将persistent进程添加到firstPids
} else {
lastPids.put(pid, Boolean.TRUE); //其他进程添加到lastPids
}
}
}
}
}
// 记录ANR输出到main log
StringBuilder info = new StringBuilder();
info.setLength(0);
info.append("ANR in ").append(app.processName);
if (activity != null && activity.shortComponentName != null) {
info.append(" (").append(activity.shortComponentName).append(")");
}
info.append("\n");
info.append("PID: ").append(app.pid).append("\n");
if (annotation != null) {
info.append("Reason: ").append(annotation).append("\n");
}
if (parent != null && parent != activity) {
info.append("Parent: ").append(parent.shortComponentName).append("\n");
}
//创建CPU tracker对象
final ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
//输出traces信息
File tracesFile = dumpStackTraces(true, firstPids, processCpuTracker,
lastPids, NATIVE_STACKS_OF_INTEREST);
updateCpuStatsNow(); //第二次更新cpu统计信息
//记录当前各个进程的CPU使用情况
synchronized (mProcessCpuTracker) {
cpuInfo = mProcessCpuTracker.printCurrentState(anrTime);
}
//记录当前CPU负载情况
info.append(processCpuTracker.printCurrentLoad());
info.append(cpuInfo);
//记录从anr时间开始的Cpu使用情况
info.append(processCpuTracker.printCurrentState(anrTime));
//输出当前ANR的reason,以及CPU使用率、负载信息
Slog.e(TAG, info.toString());
//将traces文件 和 CPU使用率信息保存到dropbox,即data/system/dropbox目录
addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
cpuInfo, tracesFile, null);
synchronized (this) {
...
//后台ANR的情况, 则直接杀掉
if (!showBackground && !app.isInterestingToUserLocked() && app.pid != MY_PID) {
app.kill("bg anr", true);
return;
}
//设置app的ANR状态,病查询错误报告receiver
makeAppNotRespondingLocked(app,
activity != null ? activity.shortComponentName : null,
annotation != null ? "ANR " + annotation : "ANR",
info.toString());
//重命名trace文件
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
if (tracesPath != null && tracesPath.length() != 0) {
//traceRenameFile = "/data/anr/traces.txt"
File traceRenameFile = new File(tracesPath);
String newTracesPath;
int lpos = tracesPath.lastIndexOf (".");
if (-1 != lpos)
// 新的traces文件= /data/anr/traces_进程名_当前日期.txt
newTracesPath = tracesPath.substring (0, lpos) + "_" + app.processName + "_" + mTraceDateFormat.format(new Date()) + tracesPath.substring (lpos);
else
newTracesPath = tracesPath + "_" + app.processName;
traceRenameFile.renameTo(new File(newTracesPath));
}
//弹出ANR对话框
Message msg = Message.obtain();
HashMap<String, Object> map = new HashMap<String, Object>();
msg.what = SHOW_NOT_RESPONDING_MSG;
msg.obj = map;
msg.arg1 = aboveSystem ? 1 : 0;
map.put("app", app);
if (activity != null) {
map.put("activity", activity);
}
//向ui线程发送,内容为SHOW_NOT_RESPONDING_MSG的消息
mUiHandler.sendMessage(msg);
}
}
当发生ANR时, 会按顺序依次执行:
1.输出ANR Reason信息到EventLog. 也就是说ANR触发的时间点最接近的就是EventLog中输出的am_anr信息;
2.收集并输出重要进程列表中的各个线程的traces信息,该方法较耗时;
3.输出当前各个进程的CPU使用情况以及CPU负载情况;
4.将traces文件和 CPU使用情况信息保存到dropbox,即data/system/dropbox目录
5.根据进程类型,来决定直接后台杀掉,还是弹框告知用户.
ANR输出重要进程的traces信息,这些进程包含:
1.firstPids队列:第一个是ANR进程,第二个是system_server,剩余是所有persistent进程;
2.Native队列:是指/system/bin/目录的mediaserver,sdcard 以及surfaceflinger进程;
3.lastPids队列: 是指mLruProcesses中的不属于firstPids的所有进程。
三.ANR分析
指导思想:通过各种线索,寻求主线程在过去一段时间内发生阻塞的可能原因。
trace文件解读
----- pid 888 at 2021-11-11 22:22:22 -----
Cmd line: system_server
ABI: arm
Build type: optimized
Zygote loaded classes=4113 post zygote classes=3239
Intern table: 57550 strong; 9315 weak
//共加载16动态库
JNI: CheckJNI is off; globals=2418 (plus 115 weak)
Libraries: /system/lib/libandroid.so /system/lib/libandroid_servers.so /system/lib/libaudioeffect_jni.so /system/lib/libcompiler_rt.so /system/lib/libjavacrypto.so /system/lib/libjnigraphics.so /system/lib/libmedia_jni.so /system/lib/librs_jni.so /system/lib/libsechook.so /system/lib/libshell_jni.so /system/lib/libsoundpool.so /system/lib/libwebviewchromium_loader.so /system/lib/libwifi-service.so /vendor/lib/libalarmservice_jni.so /vendor/lib/liblocationservice.so libjavacore.so (16)
//已分配堆内存大小40MB,其中29M已用,总分配207772个对象
Heap: 27% free, 29MB/40MB; 307772 objects
... //省略GC相关信息
//当前进程总99个线程
DALVIK THREADS (99):
//主线程调用栈
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x75bd9fb0 self=0x5573d4f770
| sysTid=12078 nice=-2 cgrp=default sched=0/0 handle=0x7fa75fafe8
| state=S schedstat=( 5907843636 827600677 5112 ) utm=453 stm=137 core=0 HZ=100
| stack=0x7fd64ef000-0x7fd64f1000 stackSize=8MB
| held mutexes=
//内核栈
kernel: __switch_to+0x70/0x7c
kernel: SyS_epoll_wait+0x2a0/0x324
kernel: SyS_epoll_pwait+0xa4/0x120
kernel: cpu_switch_to+0x48/0x4c
native: #00 pc 0000000000069be4 /system/lib64/libc.so (__epoll_pwait+8)
native: #01 pc 000000000001cca4 /system/lib64/libc.so (epoll_pwait+32)
native: #02 pc 000000000001ad74 /system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+144)
native: #03 pc 000000000001b154 /system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+80)
native: #04 pc 00000000000d4bc0 /system/lib64/libandroid_runtime.so (_ZN7android18NativeMessageQueue8pollOnceEP7_JNIEnvP8_jobjecti+48)
native: #05 pc 000000000000082c /data/dalvik-cache/arm64/system@framework@boot.oat (Java_android_os_MessageQueue_nativePollOnce__JI+144)
at android.os.MessageQueue.nativePollOnce(Native method)
at android.os.MessageQueue.next(MessageQueue.java:323)
at android.os.Looper.loop(Looper.java:135)
at com.android.server.SystemServer.run(SystemServer.java:290)
at com.android.server.SystemServer.main(SystemServer.java:175)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:738)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:628)
//线程名: Binder_1(如有daemon则代表守护线程);prio: 线程优先级;tid: 线程内部id;线程状态: NATIVE
"Binder_1" prio=5 tid=8 Native
//group: 线程所属的线程组;sCount: 线程挂起次数;dsCount: 用于调试的线程挂起次数;obj: 当前线程关联的java线程对象;self: 当前线程地址
| group="main" sCount=1 dsCount=0 obj=0x12c610a0 self=0x5573e5c750
| sysTid=12092 nice=0 cgrp=default sched=0/0 handle=0x7fa2743450
//schedstat: CPU调度时间统计
| state=S schedstat=( 796240075 863170759 3586 ) utm=50 stm=29 core=1 HZ=100
| stack=0x7fa2647000-0x7fa2649000 stackSize=1013KB
| held mutexes=
... //此处省略剩余的N个线程.
主要信息说明:
main:main标识是主线程。其他线程“Binder_1”,1表示线程id,逐步递增
prio:线程优先级,默认是5
tid:tid不是线程的id,是线程唯一标识ID
group:是线程组名称
sCount:该线程被挂起的次数
dsCount:是线程被调试器挂起的次数
obj:对象地址
self:该线程Native的地址
sysTid:是线程号(主线程的线程号和进程号相同)
nice:是线程的调度优先级
sched:分别标志了线程的调度策略和优先级
cgrp:调度归属组
handle:线程处理函数的地址
state:是调度状态
schedstat:从 /proc/[pid]/task/[tid]/schedstat读出,三个值分别表示线程在cpu上执行的时间、线程的等待时间和线程执行的时间片长度,不支持这项信息的三个值都是0;
utm:是线程用户态下使用的时间值(单位是jiffies)
stm:是内核态下的调度时间值
core:是最后执行这个线程的cpu核的序号。
分析流程
除了trace日志,我们还有很多其他日志信息可以用来帮助分析。
1.要确定ANR的种类。
2.查看CPU的占用,看IOWait的占比是否很高,判断是否在等待IO。
3.找到anr的第一时间点,对于Input类型ANR来说,可以看到“Input event dispatching timed out sending to ”,这里是 ANR被触发的第一时间点。在明确了这个时间点后,再根据不同ANR的类型分析。如广播超时需要分析第一时间点前10秒(60秒)的广播队列信息。
4.检查主线程是否存在死锁、阻塞、主线程耗时操作等问题,结合源码分析排查问题。
分析 ANR 问题时通常需要通盘考虑,综合log中各方面的信息。切忌只看问题的一个侧面就轻率作出结论。而且traces.txt的可用性并不是很高,可能反应的问题并不是根本原因。确实有一些ANR问题很难调查清楚,因为整个系统不稳定的因素很多。
实例分析
实例1:
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x72e03498 self=0x792f010800
| sysTid=8250 nice=-4 cgrp=default sched=0/0 handle=0x79b5f18ed0
| state=S schedstat=( 933532789 105856790 3028 ) utm=62 stm=30 core=6 HZ=100
| stack=0x7ff5db9000-0x7ff5dbb000 stackSize=8192KB
| held mutexes=
at com.tencent.map.geolocation.TencentLocationManager.getInstance(unavailable:-1)
- waiting to lock <0x0d50926a> (a java.lang.Class<com.tencent.map.geolocation.TencentLocationManager>) held by thread 66
at com.jingdong.common.lbs.jdlocation.JDLocationTencentSDK.getInstance(unavailable:-1)
- locked <0x0fe6d35b> (a java.lang.Class<com.jingdong.common.lbs.jdlocation.JDLocationTencentSDK>)
at com.jingdong.common.lbs.jdlocation.JDLocationSDK$14.run(unavailable:-1)
at android.os.Handler.handleCallback(Handler.java:900)
at android.os.Handler.dispatchMessage(Handler.java:103)
at android.os.Looper.loop(Looper.java:219)
at android.app.ActivityThread.main(ActivityThread.java:8347)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:513)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1055)
//省略
"pool-6-thread-2" prio=5 tid=66 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x1898b040 self=0x78b5bf9800
| sysTid=8584 nice=0 cgrp=default sched=0/0 handle=0x788faf2d50
| state=S schedstat=( 22107813 25765621 59 ) utm=0 stm=1 core=4 HZ=100
| stack=0x788f9f0000-0x788f9f2000 stackSize=1039KB
| held mutexes=
kernel: (couldn't read /proc/self/task/8584/stack)
native: #00 pc 00000000000baf14 /apex/com.android.runtime/lib64/bionic/libc.so (wait4+4)
native: #01 pc 00000000001faaa4 /apex/com.android.runtime/lib64/libart.so (art::ExecAndReturnCode(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+760)
native: #02 pc 00000000001fae58 /apex/com.android.runtime/lib64/libart.so (art::Exec(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+40)
native: #03 pc 000000000046b1b0 /apex/com.android.runtime/lib64/libart.so (art::OatFileAssistant::Dex2Oat(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+2528)
native: #04 pc 0000000000467fec /apex/com.android.runtime/lib64/libart.so (art::OatFileAssistant::GenerateOatFileNoChecks(art::OatFileAssistant::OatFileInfo&, art::CompilerFilter::Filter, art::ClassLoaderContext const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+3496)
native: #05 pc 0000000000466ed0 /apex/com.android.runtime/lib64/libart.so (art::OatFileAssistant::MakeUpToDate(bool, art::ClassLoaderContext*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+416)
native: #06 pc 000000000046eb68 /apex/com.android.runtime/lib64/libart.so (art::OatFileManager::OpenDexFilesFromOat(char const*, _jobject*, _jobjectArray*, art::OatFile const**, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>*)+340)
native: #07 pc 00000000004208d4 /apex/com.android.runtime/lib64/libart.so (art::DexFile_openDexFileNative(_JNIEnv*, _jclass*, _jstring*, _jstring*, int, _jobject*, _jobjectArray*)+140)
at dalvik.system.DexFile.openDexFileNative(Native method)
at dalvik.system.DexFile.openDexFile(DexFile.java:365)
at dalvik.system.DexFile.<init>(DexFile.java:107)
at dalvik.system.DexFile.<init>(DexFile.java:80)
at dalvik.system.DexPathList.loadDexFile(DexPathList.java:444)
at dalvik.system.DexPathList.makeDexElements(DexPathList.java:393)
at dalvik.system.DexPathList.<init>(DexPathList.java:164)
at dalvik.system.BaseDexClassLoader.<init>(BaseDexClassLoader.java:131)
at dalvik.system.BaseDexClassLoader.<init>(BaseDexClassLoader.java:89)
at dalvik.system.DexClassLoader.<init>(DexClassLoader.java:55)
at com.tencent.map.geolocation.a.a.b.a(unavailable:-1)
- locked <0x072360c5> (a com.tencent.map.geolocation.a.a.b)
at com.tencent.map.geolocation.a.a.b.a(unavailable:-1)
at com.tencent.map.geolocation.a.a.f.a(unavailable:-1)
- locked <0x0eea6a1a> (a com.tencent.map.geolocation.a.a.f)
at com.tencent.map.geolocation.TencentLocationManager.<init>(unavailable:-1)
at com.tencent.map.geolocation.TencentLocationManager.getInstance(unavailable:-1)
- locked <0x0d50926a> (a java.lang.Class<com.tencent.map.geolocation.TencentLocationManager>)
at com.jingdong.common.lbs.jdlocation.JDLocationSDK$1.run(unavailable:-1)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:929)
找到main线程,发现waiting to lock <0x0d50926a> (a java.lang.Class<com.tencent.map.geolocation.TencentLocationManager>) held by thread 66,main 线程正在尝试获取线程66的0x0d50926a lock,全局索索0x0d50926a找到对应的线程66。发现线程66 lock了0x0d50926a并开始openDexFileNative,现在需要翻看源码。
public static synchronized TencentLocationManager getInstance(Context var0) {
if (e == null) {
if (var0 == null) {
throw new NullPointerException("context is null");
}
if (var0.getApplicationContext() == null) {
throw new NullPointerException("application context is null");
}
e = new TencentLocationManager(var0.getApplicationContext());
}
return e;
}
synchronized修饰单例中,子线程进行初始化,主线程获取实例时需等子线程完成,子线程有操作耗时,造成主线程卡顿。此类问题可以考虑是否可以放到子线程中操作,类似问题有readDeviceUUID。
实例2:
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x73ee6470 self=0xb4d76500
| sysTid=22831 nice=0 cgrp=default sched=0/0 handle=0xb6f4bc00
| state=S schedstat=( 0 0 0 ) utm=22 stm=22 core=0 HZ=100
| stack=0xbe283000-0xbe285000 stackSize=8MB
| held mutexes=
native: #00 pc 000410ac /system/lib/libc.so (__ioctl+8)
native: #01 pc 000477e5 /system/lib/libc.so (ioctl+14)
native: #02 pc 0001e7c5 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
native: #03 pc 0001ee17 /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+38)
native: #04 pc 0001efcd /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+124)
native: #05 pc 00019fb7 /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+30)
native: #06 pc 00086de9 /system/lib/libandroid_runtime.so (???)
native: #07 pc 00d94629 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:503)
at android.net.INetworkPolicyManager$Stub$Proxy.getNetworkPolicies(INetworkPolicyManager.java:410)
at android.net.NetworkPolicyManager.getNetworkPolicies(NetworkPolicyManager.java:174)
at com.android.settings.net.NetworkPolicyEditor.read(NetworkPolicyEditor.java:57)
at com.android.settings.DataUsageSummary.onCreate(DataUsageSummary.java:361)
at android.app.Fragment.performCreate(Fragment.java:2202)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:942)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1148)
at android.app.BackStackRecord.run(BackStackRecord.java:793)
at android.app.FragmentManagerImpl.execPendingActions(FragmentManager.java:1535)
at android.app.FragmentManagerImpl.executePendingTransactions(FragmentManager.java:562)
at com.android.settings.SettingsActivity.switchToFragment(SettingsActivity.java:1084)
at com.android.settings.SettingsActivity.onCreate(SettingsActivity.java:657)
at android.app.Activity.performCreate(Activity.java:6251)
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1110)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2370)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2477)
at android.app.ActivityThread.-wrap11(ActivityThread.java:-1)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1345)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:148)
at android.app.ActivityThread.main(ActivityThread.java:5438)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:762)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:652)
主线程被对端block,而对端是在systemserver中的NetworkPolicyManager。
接下来继续去查NetworkPolicyManager为何会被block。
通常情况下,可以搜索getNetworkPolicies,一般的对端的函数并不会修改函数名:
看到我们的对端:
"Binder_4" prio=5 tid=56 Blocked
| group="main" sCount=1 dsCount=0 obj=0x1321a0a0 self=0xad31e200
| sysTid=2491 nice=0 cgrp=default sched=0/0 handle=0x9fd00930
| state=S schedstat=( 0 0 0 ) utm=46441 stm=46582 core=1 HZ=100
| stack=0x9fc04000-0x9fc06000 stackSize=1014KB
| held mutexes=
at com.android.server.net.NetworkPolicyManagerService.getNetworkPolicies(NetworkPolicyManagerService.java:1696)
- waiting to lock <0x07439315> (a java.lang.Object) held by thread 35
at android.net.INetworkPolicyManager$Stub.onTransact(INetworkPolicyManager.java:145)
at android.os.Binder.execTransact(Binder.java:453
被Tid=35的人拿住了一把锁(0x07439315),那么继续看tid=35是谁,有两种方法:
1. 搜索tid=35
2. 搜索0x07439315,找到 - locked <0x07439315> (a java.lang.Object)
"NetworkPolicy" prio=5 tid=35 TimedWaiting
| group="main" sCount=1 dsCount=0 obj=0x12d98940 self=0x9f91f700
| sysTid=2415 nice=0 cgrp=default sched=0/0 handle=0xa0f33930
| state=S schedstat=( 0 0 0 ) utm=7681 stm=7783 core=0 HZ=100
| stack=0xa0e31000-0xa0e33000 stackSize=1038KB
| held mutexes=
at java.lang.Object.wait!(Native method)
- waiting on <0x02580c1b> (a java.lang.Object)
at java.lang.Thread.parkFor$(Thread.java:1220)
- locked <0x02580c1b> (a java.lang.Object)
at sun.misc.Unsafe.park(Unsafe.java:299)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:634)
at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:426)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:345)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:340)
at com.android.server.NetworkManagementService.setInterfaceQuota(NetworkManagementService.java:1712)
- locked <0x0b0f91b8> (a java.lang.Object)
at com.android.server.net.NetworkPolicyManagerService.setInterfaceQuota(NetworkPolicyManagerService.java:2421)
at com.android.server.net.NetworkPolicyManagerService.updateNetworkRulesLocked(NetworkPolicyManagerService.java:1232)
at com.android.server.net.NetworkPolicyManagerService$14.onReceive(NetworkPolicyManagerService.java:1060)
- locked <0x07439315> (a java.lang.Object)
at android.app.LoadedApk$ReceiverDispatcher$Args.run(LoadedApk.java:881)
at android.os.Handler.handleCallback(Handler.java:739)
at android.os.Handler.dispatchMessage(Handler.java:95)
at android.os.Looper.loop(Looper.java:148)
at android.os.HandlerThread.run(HandlerThread.java:61)
可以看到,NetworkPolicy在通过NativeDaemonConnector和netd通信(setInterfaceQuota)
我们结合log来看下是否有有用信息,在发生ANR的前后,查找netd相关的打印:
06-19 15:29:00.997 1235 1270 I am_anr : [0,22831,com.android.settings,818429509,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)]
06-19 15:29:05.683 1235 2415 E NetdConnector: NDC Command {55445 bandwidth setiquota seth_w0 9223372036854775807} took too long (4755ms
06-19 15:29:05.723 1235 2491 I dvm_lock_sample: [system_server,1,Binder_4,4919,NetworkPolicyManagerService.java,1696,-,1056,100]
排查日志时,定位到问题点后,需根据anr类型向前排查,排查多长时间需根据anr类型确定,比如前台广播的anr需要向前找10s,查看当时处理的消息内容。
我们可以猜测ANR的原因是netd的cmd耗时太久导致的,那么在netd可能无法修改的情况下,我们应该如何去解决这个问题呢,将可能存在block的操作放到非UI线程中去做。
四.如何避免ANR
1.合理使用主线程,主线程又叫UI线程,UI线程只做跟UI相关的工作。耗时的工作(比如数据库操作,I/O,连接网络或者别的有可能阻碍UI线程的block)把它放入单独的线程处理。
2.合理使用并遵循Android生命周期,避免在onCreate()和onResume()等做过多耗时的事情。
3.sharedPreference的使用:sharedPreference的commit()方法是同步的,apply()方法一般是异步执行的。在主线程尽量使用apply()替换。sharedPreference是全量写,所以尽量统一执行apply()(apply()方法在Activity stop的时候主线程会等待写入完成),sp存储内容不宜过大。
4.如果I/O阻塞,一般来说就是文件读写或数据库操作执行在主线程了,可以通过开辟子线程的方式异步执行。
5.减少复杂的layout,避免布局深层嵌套,使用merge和include减少层级,使用ViewStub懒加载减少不必要渲染。可使用Hierarchy Viewer工具检测View性能。
6.避免频繁过多的创建线程或者其它大对象。
7.避免加载过大数据和图片。对于图片展示较多的情况,应当考虑加载缩略图等方案。
8.避免滥用广播,如果处理一个广播需要花费较长时间,应当通过Service来完成。