本文基于 Android 7.1
一、SignalCatcher 线程的启动
1.1 StartSignalCatcher
runtime.cc
void Runtime::InitNonZygoteOrPostFork(
JNIEnv* env, bool is_system_server, NativeBridgeAction action, const char* isa) {
...
StartSignalCatcher();
...
}
由上面可知,SignalCatcher 线程是在 InitNonZygoteOrPostFork 方法中启动的
runtime.cc
void Runtime::StartSignalCatcher() {
if (!is_zygote_) {
signal_catcher_ = new SignalCatcher(stack_trace_file_);
}
}
如果不是 zygote 进程,则创建一个 SignalCatcher,由此也可以知道 zygote 进程中是没有 SignalCatcher 线程的,并且用 adb shell ps -t 可以进行确认
1.2 创建 SignalCatcher
1.2.1 SignalCatcher(…)
signal_catcher.cc
SignalCatcher::SignalCatcher(const std::string& stack_trace_file)
: stack_trace_file_(stack_trace_file),
lock_("SignalCatcher lock"),
cond_("SignalCatcher::cond_", lock_),
thread_(nullptr) {
SetHaltFlag(false);
// Create a raw pthread; its start routine will attach to the runtime.
CHECK_PTHREAD_CALL(pthread_create, (&pthread_, nullptr, &Run, this), "signal catcher thread");
Thread* self = Thread::Current();
MutexLock mu(self, lock_);
while (thread_ == nullptr) {
cond_.Wait(self);
}
}
CHECK_PTHREAD_CALL(pthread_create, (&pthread_, nullptr, &Run, this), "signal catcher thread")
实际上会调用 pthread_create(&pthread_, nullptr, &Run, this)
即新创建一个线程,并调用 Run(this)
方法,pthread_ 会指向新创建的线程
1.2.2 SignalCatcher::Run
signal_catcher.cc
void* SignalCatcher::Run(void* arg) {
SignalCatcher* signal_catcher = reinterpret_cast<SignalCatcher*>(arg);
CHECK(signal_catcher != nullptr);
Runtime* runtime = Runtime::Current();
// 将当前线程 attach 到当前的 JavaVM
CHECK(runtime->AttachCurrentThread("Signal Catcher", true, runtime->GetSystemThreadGroup(),
!runtime->IsAotCompiler()));
Thread* self = Thread::Current();
DCHECK_NE(self->GetState(), kRunnable);
{
MutexLock mu(self, signal_catcher->lock_);
signal_catcher->thread_ = self;
signal_catcher->cond_.Broadcast(self);
}
// Set up mask with signals we want to handle.
SignalSet signals;
signals.Add(SIGQUIT);
signals.Add(SIGUSR1);
while (true) {
// 见 1.2.3
int signal_number = signal_catcher->WaitForSignal(self, signals);
if (signal_catcher->ShouldHalt()) {
runtime->DetachCurrentThread();
return nullptr;
}
switch (signal_number) {
case SIGQUIT:
signal_catcher->HandleSigQuit();
break;
case SIGUSR1:
signal_catcher->HandleSigUsr1();
break;
default:
LOG(ERROR) << "Unexpected signal %d" << signal_number;
break;
}
}
}
由上可知,其会添加想要 sigwait() 的信号(SIGQUIT、SIGUSR1),并执行 WaitForSignal 等待信号的到来,然后对信号分类进行处理
1.2.3 WaitForSignal
signal_catcher.cc
int SignalCatcher::WaitForSignal(Thread* self, SignalSet& signals) {
ScopedThreadStateChange tsc(self, kWaitingInMainSignalCatcherLoop);
// Signals for sigwait() must be blocked but not ignored. We
// block signals like SIGQUIT for all threads, so the condition
// is met. When the signal hits, we wake up, without any signal
// handlers being invoked.
int signal_number = signals.Wait();
if (!ShouldHalt()) {
// Let the user know we got the signal, just in case the system's too screwed for us to
// actually do what they want us to do...
LOG(INFO) << *self << ": reacting to signal " << signal_number;
// If anyone's holding locks (which might prevent us from getting back into state Runnable), say so...
Runtime::Current()->DumpLockHolders(LOG(INFO));
}
return signal_number;
}
注意上面的注释:这里 wait 的 Signals 必须是 blocked,但不是 ignored 的. 因为对于所有线程我们将类似于 SIGQUIT 的信号都 block 了(见下一节),因此条件达成。当信号到来时,程序会唤醒,并且没有 signal handlers 被调用
1.2.4 BlockSignals
runtime.cc
bool Runtime::Init(RuntimeArgumentMap&& runtime_options_in) {
...
BlockSignals();
...
}
void Runtime::BlockSignals() {
SignalSet signals;
signals.Add(SIGPIPE);
// SIGQUIT is used to dump the runtime's state (including stack traces).
signals.Add(SIGQUIT);
// SIGUSR1 is used to initiate a GC.
signals.Add(SIGUSR1);
signals.Block();
}
在虚拟机的创建过程中会将信号 block
二、HandleSigQuit
当收到 SIGQUIT 信号,即 signal 3 时,会调用 signal_catcher->HandleSigQuit()
来 dump 一些信息和 stack traces
2.1 SignalCatcher::HandleSigQuit
signal_catcher.cc
void SignalCatcher::HandleSigQuit() {
Runtime* runtime = Runtime::Current();
std::ostringstream os;
// ----- pid 2830 at 2017-11-16 11:22:53 -----
os << "\n"
<< "----- pid " << getpid() << " at " << GetIsoDate() << " -----\n";
// Cmd line: system_server
DumpCmdLine(os);
std::string fingerprint = runtime->GetFingerprint();
// Build fingerprint: 'Xiaomi/cancro_wc_lte/cancro:6.0.1/MMB29M/1.1.1:user/test-keys'
os << "Build fingerprint: '" << (fingerprint.empty() ? "unknown" : fingerprint) << "'\n";
// ABI: 'arm'
os << "ABI: '" << GetInstructionSetString(runtime->GetInstructionSet()) << "'\n";
// Build type: optimized
os << "Build type: " << (kIsDebugBuild ? "debug" : "optimized") << "\n";
runtime->DumpForSigQuit(os);
if ((false)) {
std::string maps;
if (ReadFileToString("/proc/self/maps", &maps)) {
os << "/proc/self/maps:\n" << maps;
}
}
// ----- end 2830 -----
os << "----- end " << getpid() << " -----\n";
Output(os.str());
}
2.2 DumpForSigQuit
runtime.cc
void Runtime::DumpForSigQuit(std::ostream& os) {
// Zygote loaded classes=4188 post zygote classes=3570
GetClassLinker()->DumpForSigQuit(os);
// Intern table: 59686 strong; 10043 weak
GetInternTable()->DumpForSigQuit(os);
// JNI: CheckJNI is off; globals=1993 (plus 2995 weak)
// Libraries: /system/lib/hw/gralloc.msm8974.so ...
GetJavaVM()->DumpForSigQuit(os);
oat_file_manager_->DumpForSigQuit(os);
if (GetJit() != nullptr) {
GetJit()->DumpForSigQuit(os);
} else {
os << "Running non JIT\n";
}
TrackedAllocators::Dump(os);
os << "\n";
thread_list_->DumpForSigQuit(os);
BaseMutex::DumpAll(os);
}
thread_list_->DumpForSigQuit(os)
是关键的 dump,会 dump stack traces
2.3 thread_list_->DumpForSigQuit
2.3.1 ThreadList::DumpForSigQuit
thread_list.cc
void ThreadList::DumpForSigQuit(std::ostream& os) {
{
ScopedObjectAccess soa(Thread::Current());
// Only print if we have samples.
if (suspend_all_historam_.SampleSize() > 0) {
Histogram<uint64_t>::CumulativeData data;
suspend_all_historam_.CreateHistogram(&data);
suspend_all_historam_.PrintConfidenceIntervals(os, 0.99, data); // Dump time to suspend.
}
}
bool dump_native_stack = Runtime::Current()->GetDumpNativeStackOnSigQuit();
Dump(os, dump_native_stack);
// dump 当前进程中没有 attach 的线程的 stack traces
DumpUnattachedThreads(os, dump_native_stack);
}
2.3.2 ThreadList::Dump
thread_list.cc
void ThreadList::Dump(std::ostream& os, bool dump_native_stack) {
{
MutexLock mu(Thread::Current(), *Locks::thread_list_lock_);
os << "DALVIK THREADS (" << list_.size() << "):\n";
}
DumpCheckpoint checkpoint(&os, dump_native_stack);
size_t threads_running_checkpoint;
{
// Use SOA to prevent deadlocks if multiple threads are calling Dump() at the same time.
ScopedObjectAccess soa(Thread::Current());
threads_running_checkpoint = RunCheckpoint(&checkpoint);
}
if (threads_running_checkpoint != 0) {
checkpoint.WaitForThreadsToRunThroughCheckpoint(threads_running_checkpoint);
}
}
由上面可以看到其创建了一个 DumpCheckpoint 对象 checkpoint,然后调用 RunCheckpoint(&checkpoint)
,下面我们看一下 DumpCheckpoint 是什么
2.3.3 DumpCheckpoint
thread_list.cc
class DumpCheckpoint FINAL : public Closure {
public:
DumpCheckpoint(std::ostream* os, bool dump_native_stack)
: os_(os),
barrier_(0),
backtrace_map_(dump_native_stack ? BacktraceMap::Create(getpid()) : nullptr),
dump_native_stack_(dump_native_stack) {}
void Run(Thread* thread) OVERRIDE {
Thread* self = Thread::Current();
std::ostringstream local_os;
{
ScopedObjectAccess soa(self);
// 1. dump traces 等
if (!timeout_threads_.empty()
&& find(timeout_threads_.begin(), timeout_threads_.end(), thread) != timeout_threads_.end()) {
Thread::DumpState(local_os, thread, thread->GetTid(), true);
} else {
thread->Dump(local_os, dump_native_stack_, backtrace_map_.get());
}
}
local_os << "\n";
{
// Use the logging lock to ensure serialization when writing to the common ostream.
MutexLock mu(self, *Locks::logging_lock_);
*os_ << local_os.str();
}
// 2. 每个线程在 Run 函数中 Dump thread 完成后,会通知 barrier_ 对其 count_ -1,当 count_ 为0时,说明所有线程已经完成 dump,同时把 thread_list_ 中完成 dump 的 thread 去掉
barrier_.Pass(self, thread);
}
private:
// The common stream that will accumulate all the dumps.
std::ostream* const os_;
// The barrier to be passed through and for the requestor to wait upon.
Barrier barrier_;
// A backtrace map, so that all threads use a shared info and don't reacquire/parse separately.
std::unique_ptr<BacktraceMap> backtrace_map_;
// Whether we should dump the native stack.
const bool dump_native_stack_;
std::list<Thread*> timeout_threads_;
};
可以看到:
- 创建 DumpCheckpoint 对象时,仅仅是对一些成员变量进行赋值
- DumpCheckpoint 的 Run 方法主要实现了两方面的功能:
- 其是真正执行 dump 信息的地方
- 每个线程在 Run 函数中 Dump thread 完成后,会通知 barrier_ 对其 count_ -1,当 count_ 为0时,说明所有线程已经完成 dump,同时把 thread_list_ 中完成 dump 的 thread 去掉
下面我们再来看一下 RunCheckpoint 做了什么
2.3.4 ThreadList::RunCheckpoint
thread_list.cc
size_t ThreadList::RunCheckpoint(Closure* checkpoint_function, bool isDumpCheckpoint) { // isDumpCheckpoint 默认为 false
Thread* self = Thread::Current();
Locks::mutator_lock_->AssertNotExclusiveHeld(self);
Locks::thread_list_lock_->AssertNotHeld(self);
Locks::thread_suspend_count_lock_->AssertNotHeld(self);
std::vector<Thread*> suspended_count_modified_threads;
size_t count = 0;
{
// Call a checkpoint function for each thread, threads which are suspend get their checkpoint
// manually called.
MutexLock mu(self, *Locks::thread_list_lock_);
MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
if(isDumpCheckpoint) {
((DumpCheckpoint *)checkpoint_function)->SetThreadList(self, list_);
}
// 1. 对于 list_ 中的 thread 分情况进行处理
count = list_.size();
for (const auto& thread : list_) {
if (thread != self) {
while (true) {
if (thread->RequestCheckpoint(checkpoint_function)) {
// This thread will run its checkpoint some time in the near future.
break;
} else {
// 对于 suspended 线程,先 modify SuspendCount,然后将其加入 suspended_count_modified_threads 中,后面会继续对 suspended_count_modified_threads 进行处理
if (thread->GetState() == kRunnable) {
// Spurious fail, try again.
continue;
}
thread->ModifySuspendCount(self, +1, nullptr, false);
suspended_count_modified_threads.push_back(thread);
break;
}
}
}
}
}
// Run the checkpoint on ourself while we wait for threads to suspend.
// 2. 对于 Signal Catcher 线程,在这里执行 CheckPoint function 的 Run 函数调用,进行 Thread dump
checkpoint_function->Run(self);
// Run the checkpoint on the suspended threads.
for (const auto& thread : suspended_count_modified_threads) {
if (!thread->IsSuspended()) {
if (ATRACE_ENABLED()) {
std::ostringstream oss;
thread->ShortDump(oss);
ATRACE_BEGIN((std::string("Waiting for suspension of thread ") + oss.str()).c_str());
}
// Busy wait until the thread is suspended.
const uint64_t start_time = NanoTime();
do {
ThreadSuspendSleep(kThreadSuspendInitialSleepUs);
} while (!thread->IsSuspended());
const uint64_t total_delay = NanoTime() - start_time;
// Shouldn't need to wait for longer than 1000 microseconds.
constexpr uint64_t kLongWaitThreshold = MsToNs(1);
ATRACE_END();
}
// We know for sure that the thread is suspended at this point.
// 3. 对于 suspended 线程,执行 checkpoint_function 的 Run 方法
checkpoint_function->Run(thread);
{
MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
// 4. 对于已经 dump 的线程,将其 suspend count -1
thread->ModifySuspendCount(self, -1, nullptr, false);
}
}
{
// 5. Imitate ResumeAll, threads may be waiting on Thread::resume_cond_ since we raised their
// suspend count. Now the suspend_count_ is lowered so we must do the broadcast.
MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
Thread::resume_cond_->Broadcast(self);
}
// return 的是 thread_list 的 size
return count;
}
2.3.5 WaitForThreadsToRunThroughCheckpoint
thread_list.cc
class DumpCheckpoint FINAL : public Closure {
public:
void WaitForThreadsToRunThroughCheckpoint(size_t threads_running_checkpoint) {
Thread* self = Thread::Current();
ThreadState new_state = kWaitingForCheckPointsToRun;
if(Locks::abort_lock_->IsExclusiveHeld(self) && self->GetState() == kRunnable) {
new_state = kRunnable;
}
ScopedThreadStateChange tsc(self, new_state);
bool timed_out = barrier_.Increment(self, threads_running_checkpoint, kDumpWaitTimeout);
if (timed_out) {
// Avoid a recursive abort.
LOG(ERROR) << "Unexpected time out during dump checkpoint.";
std::list<Thread*> list = barrier_.GetThreadList(self);
timeout_threads_.assign(list.begin(), list.end());
{
// abnormal dump
MutexLock mu(self, *Locks::logging_lock_);
*os_ << " ------- " << timeout_threads_.size() << " threads dump checkpoint timed out --------\n\n";
}
for (const auto& thread : timeout_threads_) {
bool contains = false;
{
MutexLock mu(self, *Locks::thread_list_lock_);
std::list<Thread*> thread_list = Runtime::Current()->GetThreadList()->GetList();
contains = find(thread_list.begin(), thread_list.end(), thread) != thread_list.end();
}
// 1. detached thread should have already passed the barrier
// 2. only kRunnable thread have been set a checkpoint function
// 3. non kRunnable thread is dumped by this thread, will not timeout
if (contains && thread->HasCheckpointFunction(this)) {
thread->RunCheckpointFunction();
}
}
}
}
};
barrier.cc
bool Barrier::Increment(Thread* self, int delta, uint32_t timeout_ms) {
MutexLock mu(self, lock_);
SetCountLocked(self, count_ + delta);
bool timed_out = false;
if (count_ != 0) {
uint32_t timeout_ns = 0;
uint64_t abs_timeout = NanoTime() + MsToNs(timeout_ms);
for (;;) {
timed_out = condition_.TimedWait(self, timeout_ms, timeout_ns);
if (timed_out || count_ == 0) return timed_out;
// Compute time remaining on timeout.
uint64_t now = NanoTime();
int64_t time_left = abs_timeout - now;
if (time_left <= 0) return true;
timeout_ns = time_left % (1000*1000);
timeout_ms = time_left / (1000*1000);
}
}
return timed_out;
}
可以看到 Increment 在两种情况下会返回,timeout 或者 count_ == 0(即所有的线程都完成 dump)
由此,也可以看出 WaitForThreadsToRunThroughCheckpoint 方法的作用就是等待所有的线程都完成 dump,并且对于超时没有完成 dump 的情况进行一些特殊处理
2.3.6 RequestCheckpoint
thread.cc
bool Thread::RequestCheckpoint(Closure* function) {
union StateAndFlags old_state_and_flags;
old_state_and_flags.as_int = tls32_.state_and_flags.as_int;
if (old_state_and_flags.as_struct.state != kRunnable) {
return false; // 1. Fail, thread is suspended and so can't run a checkpoint.
}
uint32_t available_checkpoint = kMaxCheckpoints;
for (uint32_t i = 0 ; i < kMaxCheckpoints; ++i) {
if (tlsPtr_.checkpoint_functions[i] == nullptr) {
available_checkpoint = i;
break;
}
}
if (available_checkpoint == kMaxCheckpoints) {
// 2. No checkpoint functions available, we can't run a checkpoint
return false;
}
// 3. 设置 checkpoint_function
tlsPtr_.checkpoint_functions[available_checkpoint] = function;
// Checkpoint function installed now install flag bit.
// We must be runnable to request a checkpoint.
DCHECK_EQ(old_state_and_flags.as_struct.state, kRunnable);
union StateAndFlags new_state_and_flags;
new_state_and_flags.as_int = old_state_and_flags.as_int;
new_state_and_flags.as_struct.flags |= kCheckpointRequest;
bool success = tls32_.state_and_flags.as_atomic_int.CompareExchangeStrongSequentiallyConsistent(
old_state_and_flags.as_int, new_state_and_flags.as_int);
if (UNLIKELY(!success)) {
// The thread changed state before the checkpoint was installed.
CHECK_EQ(tlsPtr_.checkpoint_functions[available_checkpoint], function);
tlsPtr_.checkpoint_functions[available_checkpoint] = nullptr;
} else {
CHECK_EQ(ReadFlag(kCheckpointRequest), true);
TriggerSuspend();
}
return success;
}
这个方法实际上是针对 kRunnable 线程的,会对其设置 checkpoint_function,当线程运行到 checkpoint 的点时,会执行 checkpoint_functions 中的 function,在我们这种情况下会执行到 DumpCheckpoint 的 Run 方法。
2.4 总结
从上面的分析可以看出:
- 进行 DumpForSigQuit 时,RunCheckpoint 是最主要的处理,其主要将线程分为 suspended 和 kRunnable 两种情况来对线程进行 dump:
- 对于 suspended 状态的线程,会将其存在一个 suspended_count_modified_threads 结构中,后面会对 suspended_count_modified_threads 中的每个线程执行 DumpCheckpoint 的 Run 方法(即 dump);这种情况下,对每个 suspended 线程的 dump 运行在 “Signal Catcher” 线程中
- 对于 kRunnable 状态的线程,会对其执行 RequestCheckpoint 操作,即对其设置 checkpoint_function,当线程运行到 checkpoint 的点时,会执行 checkpoint_functions 中的 function;可以看到这种情况下 dump 操作在各个线程中
- 执行过 RunCheckpoint 方法后,会执行
checkpoint.WaitForThreadsToRunThroughCheckpoint(threads_running_checkpoint)
- threads_running_checkpoint 是整个 thread_list 的 size,也就是需要 dump 的线程的数量
- 每个线程在 Run 函数中 Dump thread 完成后,会通知 barrier_ 对其 count_ -1,当 count_ 为0时,说明所有线程已经完成 dump,同时把 thread_list_ 中完成 dump 的 thread 去掉;这里等待的就是 barrier_ 的 count 变为 0,如果超时未完成则会进行一些处理