简介
原文请看我的博客:从源码角度看traces.txt是如何生成的
traces.txt 位于安卓系统下/data/anr目录下,当系统中有应用出现ANR时,framework会在弹出Dialog的同时dump出当前各线程的堆栈情况,方便开发者分析出ANR的root cause。
ANR是Application Not Responsing 的简称,简而言之,就是安卓系统内置提示用户应用界面没有反应的机制,是用来避免应用界面一直卡顿,增加系统用户友好度的一种方式。
造成ANR的原因很多,都是因为在主线程执行任务太久阻塞了界面更新导致的,主要有以下几类:
- Broadcast Timeout: 前台广播执行超过10s, 后台广播执行超过60s (要注意的是,只有串行的广播才有超时机制,并行的广播并不会超时,也就是说,如果广播是动态注册的,直接调用sendBroadcast触发,如果主线程Looper中排在后面的Message不会触发超时机制,那么即时这个广播是前台广播,系统也永远不会弹出框提示用户超时了)
- Service Timeout: 前台服务之星超过20s, 后台服务之星超过200s
- Provider Timeout: 内容提供者,publish超过10s
- Input Timeout: 按键触摸事件分发超过5s
能够造成ANR的前提是任务是在主线程上执行的,执行什么样的任务主要有以下几点:
- 执行耗时任务过久,如文件读取或存储,网络访问获取文件太耗时
- 线程被阻塞过久,或者干脆出现了死锁
- 线程饥饿,如Binder线程总共16个,Binder主线程占有一个,剩余的15个工作线程都被占满
- CPU饥饿,负载值过大,虽然代码正常额但任务一直没有来得及执行
那么回到traces.txt文件,它到底包含了什么信息可以帮助开发者找到ANR问题的根源呢。
在这篇文章,我先从一份traces.txt的日志实例开始解析,然后再通过追踪源码来解释traces.txt是如何生成的。
我是基于Android 5.0原生代码进行分析的,最新的一些代码逻辑放到了AppErrors类中了,有些小差别,但是大体的逻辑和步骤没有变化,不会影响阅读。
日志实例
// 触发ANR的应用pid及触发时间
----- pid 30307 at 2017-09-03 14:51:14 -----
Cmd line: com.example.androidyue.bitmapdemo
JNI: CheckJNI is off; workarounds are off; pins=0; globals=272
DALVIK THREADS:
(mutexes: tll=0 tsl=0 tscl=0 ghl=0)
// JAVA 态线程状态
// 线程名称; 优先级; 线程id; 线程状态
"main" prio=5 tid=1 TIMED_WAIT
// 线程组; 线程suspend计数; 线程debug suspend计数; 同辈线程对象; 当前线程对象
| group="main" sCount=1 dsCount=0 obj=0x416eaf18 self=0x416d8650
// 线程id; 进程优先级; 调度者优先级; 调度者组名; 线程缓存句柄
| sysTid=30307 nice=0 sched=0/0 cgrp=apps handle=1074565528
// native线程状态; 调度者状态; 执行用户代码的总时间; 执行系统代码的总时间; 任务cpu使用数量
| state=S schedstat=( 0 0 0 ) utm=5 stm=4 core=3
// 线程调用栈
at java.lang.VMThread.sleep(Native Method)
at java.lang.Thread.sleep(Thread.java:1044)
at java.lang.Thread.sleep(Thread.java:1026)
at com.example.androidyue.bitmapdemo.MainActivity$1.run(MainActivity.java:27)
at android.app.Activity.runOnUiThread(Activity.java:4794)
at com.example.androidyue.bitmapdemo.MainActivity.onResume(MainActivity.java:33)
at android.app.Instrumentation.callActivityOnResume(Instrumentation.java:1282)
at android.app.Activity.performResume(Activity.java:5405)
// Native态线程状态
"Binder_5" prio=5 tid=32 NATIVE
| group="main" sCount=1 dsCount=0 obj=0x41dd6570 self=0x678fa458
| sysTid=29473 nice=0 sched=0/0 cgrp=apps handle=1732666432
| state=S schedstat=( 578000 10600000 3 ) utm=0 stm=0 core=0
#00 pc 00016fe4 /system/lib/libc.so (__ioctl+8)
#01 pc 0002a97d /system/lib/libc.so (ioctl+16)
#02 pc 00016ba1 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
#03 pc 00017363 /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+154)
#04 pc 0001b15d /system/lib/libbinder.so
#05 pc 00011267 /system/lib/libutils.so (android::Thread::_threadLoop(void*)+114)
#06 pc 0004679f /system/lib/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+66)
#07 pc 00010dcd /system/lib/libutils.so
#08 pc 0000e3d8 /system/lib/libc.so (__thread_entry+72)
#09 pc 0000dac4 /system/lib/libc.so (pthread_create+160)
at dalvik.system.NativeStart.run(Native Method)复制代码
关键词对照
关键词 | 解释 | |
---|---|---|
prio | 优先级 | |
tid | 线程id | |
group | 组名 | |
sCount | 挂起次数 | |
dsCount | 调试下 挂起计数 | |
obj | 同辈线程对象 | |
self | 当前线程对象 | |
sysTid | 线程id | |
nice | 进程优先级 | |
sched | 调度者优先级 | |
cgrp | 调度者组名 | |
handle | 线程缓存句柄 | |
state | native线程状态 | |
schedstat | 调度者状态 | |
utm | 用户态CPU时间 | |
stm | 内核态CPU时间 | |
core | 运行所在核 |
线程状态对照
JAVA状态 | CPP状态 | 解释 |
---|---|---|
TERMINATED | ZOMBIE | 线程死亡 |
RUNNABLE | RUNNING/RUNNABLE | 可运行或正在运行 |
TIMED_WAITING | TIMED_WAIT | 执行了带有超时参数的wait, sleep或join函数 |
BLOCKED | MONITOR | 线程阻塞,等待获取对象锁 |
WAITING | WAIT | 执行了无超时参数的wait |
NEW | INITIALIZING | 新建,正在初始化中 |
NEW | STARTING | 新建,正在启动 |
RUNNABLE | NATIVE | 正在执行JNI本地函数 |
WAITING | VMWAIT | 正在等待VM资源 |
RUNNABLE | SUSPENDED | 线程暂停,通常由于GC或者DEBUG被挂起 |
源码分析 (虚拟机)
我将从代码的最底层开始分析,再慢慢到触发点的顶端,先从cpp开始
Thread::DumpState
thread.cc
void Thread::DumpState(std::ostream& os, const Thread* thread, pid_t tid) {
// 组名
std::string group_name;
// 优先级
int priority;
// 是否是daemon线程
bool is_daemon = false;
// 拿到当前线程对象指针
Thread* self = Thread::Current();
if (thread != nullptr) {
...
}
if (gAborting == 0 && self != nullptr && thread != nullptr && thread->tlsPtr_.opeer != nullptr) {
ScopedObjectAccessUnchecked soa(self);
priority = soa.DecodeField(WellKnownClasses::java_lang_Thread_priority)
->GetInt(thread->tlsPtr_.opeer);
is_daemon = soa.DecodeField(WellKnownClasses::java_lang_Thread_daemon)
->GetBoolean(thread->tlsPtr_.opeer);
...
if (thread_group != nullptr) {
...
group_name = (group_name_string != nullptr) ? group_name_string->ToModifiedUtf8() : "<null>";
}
} else {
priority = GetNativePriority();
}
std::string scheduler_group_name(GetSchedulerGroupName(tid));
if (scheduler_group_name.empty()) {
scheduler_group_name = "default";
}
// 这里开始输出线程状态
if (thread != nullptr) {
// 线程名输出
os << '"' << *thread->tlsPtr_.name << '"';
if (is_daemon) {
os << " daemon";
}
// 优先级; ID; 状态
os << " prio=" << priority
<< " tid=" << thread->GetThreadId()
<< " " << thread->GetState();
if (thread->IsStillStarting()) {
os << " (still starting up)";
}
os << "\n";
} else {
os << '"' << ::art::GetThreadName(tid) << '"'
<< " prio=" << priority
<< " (not attached)\n";
}
// 线程组; 线程suspend计数; 线程debug suspend计数; 同辈线程对象; 当前线程对象
if (thread != nullptr) {
MutexLock mu(self, *Locks::thread_suspend_count_lock_);
os << " | group=\"" << group_name << "\""
<< " sCount=" << thread->tls32_.suspend_count
<< " dsCount=" << thread->tls32_.debug_suspend_count
<< " obj=" << reinterpret_cast<void*>(thread->tlsPtr_.opeer)
<< " self=" << reinterpret_cast<const void*>(thread) << "\n";
}
// 线程id; 进程优先级; 调度者组名
os << " | sysTid=" << tid
<< " nice=" << getpriority(PRIO_PROCESS, tid)
<< " cgrp=" << scheduler_group_name;
// 调度者优先级; 缓存句柄
if (thread != nullptr) {
int policy;
sched_param sp;
CHECK_PTHREAD_CALL(pthread_getschedparam, (thread->tlsPtr_.pthread_self, &policy, &sp),
__FUNCTION__);
os << " sched=" << policy << "/" << sp.sched_priority
<< " handle=" << reinterpret_cast<void*>(thread->tlsPtr_.pthread_self);
}
os << "\n";
// Grab the scheduler stats for this thread.
std::string scheduler_stats;
if (ReadFileToString(StringPrintf("/proc/self/task/%d/schedstat", tid), &scheduler_stats)) {
scheduler_stats.resize(scheduler_stats.size() - 1); // Lose the trailing '\n'.
} else {
scheduler_stats = "0 0 0";
}
char native_thread_state = '?';
int utime = 0;
int stime = 0;
int task_cpu = 0;
GetTaskStats(tid, &native_thread_state, &utime, &stime, &task_cpu);
// native线程状态; 调度者状态; 执行用户代码的总时间; 执行系统代码的总时间; 任务cpu使用数量
os << " | state=" << native_thread_state
<< " schedstat=( " << scheduler_stats << " )"
<< " utm=" << utime
<< " stm=" << stime
<< " core=" << task_cpu
<< " HZ=" << sysconf(_SC_CLK_TCK) << "\n";
if (thread != nullptr) {
// dump 线程当前栈
os << " | stack=" << reinterpret_cast<void*>(thread->tlsPtr_.stack_begin) << "-"
<< reinterpret_cast<void*>(thread->tlsPtr_.stack_end) << " stackSize="
<< PrettySize(thread->tlsPtr_.stack_size) << "\n";
// dump 线程锁信息
os << " | held mutexes=";
for (size_t i = 0; i < kLockLevelCount; ++i) {
if (i != kMonitorLock) {
BaseMutex* mutex = thread->GetHeldMutex(static_cast<LockLevel>(i));
if (mutex != nullptr) {
os << " \"" << mutex->GetName() << "\"";
if (mutex->IsReaderWriterMutex()) {
ReaderWriterMutex* rw_mutex = down_cast<ReaderWriterMutex*>(mutex);
if (rw_mutex->GetExclusiveOwnerTid() == static_cast<uint64_t>(tid)) {
os << "(exclusive held)";
} else {
os << "(shared held)";
}
}
}
}
}
os << "\n";
}
}复制代码
GetTaskStats
utils.cc
void GetTaskStats(pid_t tid, char* state, int* utime, int* stime, int* task_cpu) {
*utime = *stime = *task_cpu = 0;
std::string stats;
// 从stat节点获取数据
if (!ReadFileToString(StringPrintf("/proc/self/task/%d/stat", tid), &stats)) {
return;
}
// Skip the command, which may contain spaces.
stats = stats.substr(stats.find(')') + 2);
// Extract the three fields we care about.
std::vector<std::string> fields;
Split(stats, ' ', &fields);
*state = fields[0][0];
*utime = strtoull(fields[11].c_str(), nullptr, 10);
*stime = strtoull(fields[12].c_str(), nullptr, 10);
*task_cpu = strtoull(fields[36].c_str(), nullptr, 10);
}复制代码
DumpUnattachedThread
thread_list.cc
// 读取该线程中的所有状态信息:JAVA态,Native态,Kernel态
static void DumpUnattachedThread(std::ostream& os, pid_t tid) NO_THREAD_SAFETY_ANALYSIS {
// [接上]
Thread::DumpState(os, nullptr, tid);
DumpKernelStack(os, tid, " kernel: ", false);
if (false) {
DumpNativeStack(os, tid, " native: ");
}
os << "\n";
}
void ThreadList::DumpUnattachedThreads(std::ostream& os) {
DIR* d = opendir("/proc/self/task");
if (!d) {
return;
}
Thread* self = Thread::Current();
dirent* e;
// 获取当前进程中所有线程,dump这些线程的信息
while ((e = readdir(d)) != nullptr) {
char* end;
pid_t tid = strtol(e->d_name, &end, 10);
if (!*end) {
bool contains;
{
MutexLock mu(self, *Locks::thread_list_lock_);
contains = Contains(tid);
}
if (!contains) {
// [接上]
DumpUnattachedThread(os, tid);
}
}
}
closedir(d);
}
bool ThreadList::Contains(Thread* thread) {
return find(list_.begin(), list_.end(), thread) != list_.end();
}复制代码
DumpForSigQuit
thread_list.cc
void ThreadList::DumpForSigQuit(std::ostream& os) {
...
DumpUnattachedThreads(os);
}复制代码
runtime.cc
void Runtime::DumpForSigQuit(std::ostream& os) {
GetClassLinker()->DumpForSigQuit(os);
GetInternTable()->DumpForSigQuit(os);
GetJavaVM()->DumpForSigQuit(os);
GetHeap()->DumpForSigQuit(os);
TrackedAllocators::Dump(os);
os << "\n";
// [接上]
// dump 线程列表
thread_list_->DumpForSigQuit(os);
BaseMutex::DumpAll(os);
}复制代码
HandleSigQuit
singal_catcher.cc
void SignalCatcher::HandleSigQuit() {
Runtime* runtime = Runtime::Current();
std::ostringstream os;
// 输出进程id以及当前dump时间
os << "\n"
<< "----- pid " << getpid() << " at " << GetIsoDate() << " -----\n";
DumpCmdLine(os);
// 输出系统信息
std::string fingerprint = runtime->GetFingerprint();
os << "Build fingerprint: '" << (fingerprint.empty() ? "unknown" : fingerprint) << "'\n";
os << "ABI: '" << GetInstructionSetString(runtime->GetInstructionSet()) << "'\n";
os << "Build type: " << (kIsDebugBuild ? "debug" : "optimized") << "\n";
// [接上]
runtime->DumpForSigQuit(os);
...
os << "----- end " << getpid() << " -----\n";
Output(os.str());
}复制代码
SignalCatcher::Run
void* SignalCatcher::Run(void* arg) {
...
// Set up mask with signals we want to handle.
SignalSet signals;
signals.Add(SIGQUIT);
signals.Add(SIGUSR1);
while (true) {
// 等待信号
int signal_number = signal_catcher->WaitForSignal(self, signals);
...
switch (signal_number) {
case SIGQUIT:
// [接上]
signal_catcher->HandleSigQuit();
break;
case SIGUSR1:
signal_catcher->HandleSigUsr1();
break;
default:
LOG(ERROR) << "Unexpected signal %d" << signal_number;
break;
}
}
}
int SignalCatcher::WaitForSignal(Thread* self, SignalSet& signals) {
...
int signal_number = signals.Wait();
...
return signal_number;
}复制代码
源码分析 (framework)
可见,android framework通过发送信号3来通知虚拟机来进行dump到traces.txt文件中的
dumpStackTraces
ActivityManagerService.java
private static void dumpStackTraces(String tracesPath, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
// 当trace写完后,监听并触发onEvent方法
FileObserver observer = new FileObserver(tracesPath, FileObserver.CLOSE_WRITE) {
@Override
public synchronized void onEvent(int event, String path) {
notify();
}
};
try {
// 开始监听
observer.startWatching();
// 首先收集firstPids链表里的进程信息
if (firstPids != null) {
try {
int num = firstPids.size();
for (int i = 0; i < num; i++) {
synchronized (observer) {
// 发送信号3来触发
Process.sendSignal(firstPids.get(i), Process.SIGNAL_QUIT);
observer.wait(200); // Wait for write-close, give up after 200msec
}
}
} catch (InterruptedException e) {
Slog.wtf(TAG, e);
}
}
// 收集native进程信息
...
// 最后收集CPU信息
...
} finally {
observer.stopWatching();
}
}复制代码
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
// 获取trace文件路径
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
...
File tracesFile = new File(tracesPath);
try {
...
FileUtils.setPermissions(tracesDir.getPath(), 0775, -1, -1); // drwxrwxr-x
if (clearTraces && tracesFile.exists()) tracesFile.delete();
tracesFile.createNewFile();
FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1); // -rw-rw-rw-
} catch (IOException e) {
Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesPath, e);
return null;
}
// [接上]
dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
return tracesFile;
}复制代码
appNotResponding
ActivityManagerService.java
final void appNotResponding(ProcessRecord app, ActivityRecord activity,
ActivityRecord parent, boolean aboveSystem, final String annotation) {
ArrayList<Integer> firstPids = new ArrayList<Integer>(5);
SparseArray<Boolean> lastPids = new SparseArray<Boolean>(20);
...
final ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
// [接上]
File tracesFile = dumpStackTraces(true, firstPids, processCpuTracker, lastPids,
NATIVE_STACKS_OF_INTEREST);
...
Slog.e(TAG, info.toString());
if (tracesFile == null) {
// There is no trace file, so dump (only) the alleged culprit's threads to the log
Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
}
addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
cpuInfo, tracesFile, null);
...
}复制代码
触发ANR的方式上面我已经列举过了,这里我以触发Broadcast Timeout时进行分析
AppNotResponding.run
BroadcastQueue.java
private final class AppNotResponding implements Runnable {
private final ProcessRecord mApp;
private final String mAnnotation;
public AppNotResponding(ProcessRecord app, String annotation) {
mApp = app;
mAnnotation = annotation;
}
@Override
public void run() {
// [接上]
mService.appNotResponding(mApp, null, null, false, mAnnotation);
}
}复制代码
broadcastTimeoutLocked
BroadcastQueue.java
final void broadcastTimeoutLocked(boolean fromMsg) {
...
String anrMessage = null;
...
if (app != null) {
anrMessage = "Broadcast of " + r.intent.toString();
}
...
if (anrMessage != null) {
// [接上]
mHandler.post(new AppNotResponding(app, anrMessage));
}
}复制代码
BroadcastHandler.handleMessage
private final class BroadcastHandler extends Handler {
public BroadcastHandler(Looper looper) {
super(looper, null, true);
}
@Override
public void handleMessage(Message msg) {
switch (msg.what) {
...
case BROADCAST_TIMEOUT_MSG: {
synchronized (mService) {
// [接上]
broadcastTimeoutLocked(true);
}
} break;
...
}
}
};复制代码
setBroadcastTimeoutLocked
BroadcastQueue.java
final void setBroadcastTimeoutLocked(long timeoutTime) {
if (! mPendingBroadcastTimeoutMessage) {
Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this);
mHandler.sendMessageAtTime(msg, timeoutTime);
mPendingBroadcastTimeoutMessage = true;
}
}
final void processNextBroadcast(boolean fromMsg) {
if (! mPendingBroadcastTimeoutMessage) {
long timeoutTime = r.receiverTime + mTimeoutPeriod;
// [接上]
setBroadcastTimeoutLocked(timeoutTime);
}
}复制代码
总结
下面总结一下触发广播ANR后输出traces.txt的过程:
- 有序广播在注册后发送
- AMS设置广播超时时间,通过Handler进行延迟设定
- 广播超时触发超时机制
- Framework层发送信号给虚拟机
- 虚拟机dump出特定进程所有线程信息并存储到traces.txt文件中