Android ANR Trace 详解

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/hl09083253cy/article/details/78418742

本文总结一下 Signal Catcher 线程在收到 SIGQUIT(3)后,Dump 信息的流程。

最主要还是解析 ANR trace种每种数据的含义,让我们更清晰的认识 trace。

Android SourceCode: 6.0

Keyword:block signal,kRunnable,kSuspended,Checkpoint,traces.txt


1.ART中 block 信号

信号的block在虚拟机启动的时候就做了。

bool Runtime::Init(RuntimeArgumentMap&& runtime_options_in) {
  ...
  BlockSignals();
  ...
}
void Runtime::BlockSignals() {
  SignalSet signals;
  signals.Add(SIGPIPE);
  // SIGQUIT is used to dump the runtime's state (including stack traces).
  signals.Add(SIGQUIT);
  // SIGUSR1 is used to initiate a GC.
  signals.Add(SIGUSR1);
  signals.Block();
}

这里需要关注的是:在主进程中屏蔽信号,主进程创建出来的线程将会继承掩码,同样屏蔽信号

在这里相当于把 SIGPIPE,SIGQUIT,SIGUSR1这三个信号都屏蔽了。

2.SignalCatcher线程的启动

void Runtime::DidForkFromZygote(JNIEnv* env, NativeBridgeAction action, const char* isa) {
  ...
  StartSignalCatcher();
  ...
}

从这里可以看到 signal catcher线程的启动是从 DidForkFromZygote()函数启动的,

说明 zygote进程中并没有 signal catcher 线程的存在,可以使用ps -t 看一下。

void Runtime::StartSignalCatcher() {
  if (!is_zygote_) {
    signal_catcher_ = new SignalCatcher(stack_trace_file_);
  }
}
SignalCatcher::SignalCatcher(const std::string& stack_trace_file)
    : stack_trace_file_(stack_trace_file),
      lock_("SignalCatcher lock"),
      cond_("SignalCatcher::cond_", lock_),
      thread_(nullptr) {
  SetHaltFlag(false);
  // Create a raw pthread; its start routine will attach to the runtime.
  CHECK_PTHREAD_CALL(pthread_create, (&pthread_, nullptr, &Run, this), "signal catcher thread");
  Thread* self = Thread::Current();
  MutexLock mu(self, lock_);
  while (thread_ == nullptr) {
    cond_.Wait(self);
  }
}

 创建线程执行 signal_catcher.cc的 Run方法:

void* SignalCatcher::Run(void* arg) {
  SignalCatcher* signal_catcher = reinterpret_cast<SignalCatcher*>(arg);
  CHECK(signal_catcher != nullptr);
  Runtime* runtime = Runtime::Current();
  CHECK(runtime->AttachCurrentThread("Signal Catcher", true, runtime->GetSystemThreadGroup(),
                                     !runtime->IsAotCompiler()));
  Thread* self = Thread::Current();
  DCHECK_NE(self->GetState(), kRunnable);
  {
    MutexLock mu(self, signal_catcher->lock_);
    signal_catcher->thread_ = self;
    signal_catcher->cond_.Broadcast(self);
  }
  // Set up mask with signals we want to handle.
  SignalSet signals;
  signals.Add(SIGQUIT);
  signals.Add(SIGUSR1);
  while (true) {
    int signal_number = signal_catcher->WaitForSignal(self, signals);
    if (signal_catcher->ShouldHalt()) {
      runtime->DetachCurrentThread();
      return nullptr;
    }
    switch (signal_number) {
    case SIGQUIT:
      signal_catcher->HandleSigQuit();
      break;
    case SIGUSR1:
      signal_catcher->HandleSigUsr1();
      break;
    default:
      LOG(ERROR) << "Unexpected signal %d" << signal_number;
      break;
    }
  }
}

这里在 Run函数中,Attach Thread之后,才执行signal_catcher->cond_.Broadcast(self);

是为了保证Signal Catcher构造完成后,signal catcher线程已经运行且已经attach到当前VM。

另外 SignalCatcher waitForSignal,调用了 sigwait函数,等待 SIGQUIT和SIGUSR1信号的到来;

所以大部分时候signal catcher线程都处在Sleep状态,等待这两个信号,直到其中一个信号到来,才会继续运行。

 

3.HandleSigQuit

当收到SIGQUIT即 signal 3时,signal catcher 会调用HandleSignalQuit函数来进行一些信息的Dump;

void SignalCatcher::HandleSigQuit() {
  Runtime* runtime = Runtime::Current();
  std::ostringstream os;
  os << "\n"
      << "----- pid " << getpid() << " at " << GetIsoDate() << " -----\n";
  DumpCmdLine(os);
  // Note: The strings "Build fingerprint:" and "ABI:" are chosen to match the format used by
  // debuggerd. This allows, for example, the stack tool to work.
  std::string fingerprint = runtime->GetFingerprint();
  os << "Build fingerprint: '" << (fingerprint.empty() ? "unknown" : fingerprint) << "'\n";
  os << "ABI: '" << GetInstructionSetString(runtime->GetInstructionSet()) << "'\n";
  os << "Build type: " << (kIsDebugBuild ? "debug" : "optimized") << "\n";
  // 在Android5.0之前的版本上,在Dump之前会先SuspendAll thread,等到 Dump后再调用 ResumeAll恢复运行;
  // 在之后的版本上,Dump Thread 是利用CheckPoint 来进行 Thread Dump。
  runtime->DumpForSigQuit(os); 
  if ((false)) {
    std::string maps;
    if (ReadFileToString("/proc/self/maps", &maps)) {
      os << "/proc/self/maps:\n" << maps;
    }
  }
  os << "----- end " << getpid() << " -----\n";
  Output(os.str());
}

首先,Dump 当前进程pid,时间,名称,fingerprint,ABI等,比如一个anr 信息 traces.txt 的开头:

----- pid 9723 at 2017-04-11 17:12:10 -----
Cmd line: com.android.mms
Build fingerprint: '××××07:userdebug/test-keys'
ABI: 'arm64'
Build type: optimized

接着调用 runtime->DumpForSigQuit(os); 来Dump当前进程的详细信息;

接着写入当前trace的结束标志:

----- end 10814 -----

最后通过 Output(os.str()); 写入到 ANR trace 文件: /data/anr/traces.txt;

 

4.Rumtime DumpForSigQuit

void Runtime::DumpForSigQuit(std::ostream& os) {
  GetClassLinker()->DumpForSigQuit(os);
  GetInternTable()->DumpForSigQuit(os);
  GetJavaVM()->DumpForSigQuit(os);
  GetHeap()->DumpForSigQuit(os);
  TrackedAllocators::Dump(os);
  os << "\n";
  thread_list_->DumpForSigQuit(os);
  BaseMutex::DumpAll(os);
}

可以看到处理 SigQuit时,还是有较多信息打印的:

   GetClassLinker()->DumpForSigQuit(os);Dump当前进程的加载的Class数目,比如:
Zygote loaded classes=4530 post zygote classes=849
  GetInternTable()->DumpForSigQuit(os);Dump当前进程的String常量池信息:
Intern table: 41808 strong; 360 weak
  GetJavaVM()->DumpForSigQuit(os);Dump当前VM的相关信息,globa reference数量,weak global ref 数量,so库:
JNI: CheckJNI is off; globals=719 (plus 410 weak) // 这两个值超过 51200时进程会 Abort
Libraries: /system/lib64/libandroid.so /system/lib64/libcompiler_rt.so /system/lib64/libdrmframework_jni.so
  GetHeap()->DumpForSigQuit(os);这个里面数据较多,主要是Dump当前Heap的信息和GC效率相关的信息,dex信息,JIT/Profile相关信息;

数据较多,不在这里贴出了,可以拿一个traces.txt查看;

  TrackedAllocators::Dump(os);在kEnableTrackingAllocator开关打开的情况下,会Dump Native mem的使用信息,默认没有打开;

 

  thread_list_->DumpForSigQuit(os);这个是关键的Dump,就是我们thread的调用栈dump,比如signal catcher Thread信息Dump:

DALVIK THREADS (27):
"Signal Catcher" daemon prio=5 tid=3 Runnable
  | group="system" sCount=0 dsCount=0 obj=0x32c050d0 self=0x7f97dd1400
  | sysTid=9729 nice=0 cgrp=default sched=0/0 handle=0x7fa200e450
  | state=R schedstat=( 217991249 1074429 82 ) utm=15 stm=6 core=4 HZ=100
  | stack=0x7fa1f14000-0x7fa1f16000 stackSize=1005KB
  | held mutexes= "mutator lock"(shared held)
  native: #00 pc 000000000047661c  /system/lib64/libart.so (_ZN3art15DumpNativeStackERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEEiP12BacktraceMapPKcPNS_9ArtMethodEPv+220)
  native: #01 pc 0000000000476618  /system/lib64/libart.so (_ZN3art15DumpNativeStackERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEEiP12BacktraceMapPKcPNS_9ArtMethodEPv+216)
  native: #02 pc 000000000044ae64  /system/lib64/libart.so (_ZNK3art6Thread9DumpStackERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEEbP12BacktraceMap+472)
  native: #03 pc 0000000000462584  /system/lib64/libart.so (_ZN3art14DumpCheckpoint3RunEPNS_6ThreadE+820)
  native: #04 pc 000000000045a864  /system/lib64/libart.so (_ZN3art10ThreadList13RunCheckpointEPNS_7ClosureE+456)
  native: #05 pc 000000000045a474  /system/lib64/libart.so (_ZN3art10ThreadList4DumpERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEEb+288)
  native: #06 pc 000000000045a310  /system/lib64/libart.so (_ZN3art10ThreadList14DumpForSigQuitERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+804)
  native: #07 pc 00000000004364bc  /system/lib64/libart.so (_ZN3art7Runtime14DumpForSigQuitERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+344)
  native: #08 pc 000000000043cb74  /system/lib64/libart.so (_ZN3art13SignalCatcher13HandleSigQuitEv+2240)
  native: #09 pc 000000000043b69c  /system/lib64/libart.so (_ZN3art13SignalCatcher3RunEPv+476)
  native: #10 pc 00000000000681a4  /system/lib64/libc.so (_ZL15__pthread_startPv+196)
  native: #11 pc 000000000001db80  /system/lib64/libc.so (__start_thread+16)
  (no managed stack frames)

这里会主要梳理thread backtrace的打印过程;

 

5.Threadlist DumpForSigQuit

看一下ThreadList 的Dump过程:

 void ThreadList::DumpForSigQuit(std::ostream& os) {
  {
    ScopedObjectAccess soa(Thread::Current());
    // Only print if we have samples.
    if (suspend_all_historam_.SampleSize() > 0) { // 这个数据记录一次SuspendAll所花费的时间,如果记录里有数据就进行dump
      Histogram<uint64_t>::CumulativeData data;
      suspend_all_historam_.CreateHistogram(&data);
      suspend_all_historam_.PrintConfidenceIntervals(os, 0.99, data);  // Dump time to suspend.
    }
  }
  Dump(os); // Dump thread list
  DumpUnattachedThreads(os); // 对于当前进程中,没有Attach 的线程进行Dump
}
 
void ThreadList::Dump(std::ostream& os) {
  {
    MutexLock mu(Thread::Current(), *Locks::thread_list_lock_);
    os << "DALVIK THREADS (" << list_.size() << "):\n";
  }
  DumpCheckpoint checkpoint(&os); // 设置CheckPoint函数
  size_t threads_running_checkpoint = RunCheckpoint(&checkpoint); // 执行CheckPoint函数进行 thread Dump
  if (threads_running_checkpoint != 0) {
    checkpoint.WaitForThreadsToRunThroughCheckpoint(threads_running_checkpoint); // 等待所有线程执行完CheckPoint,线程的个数作为参数传递
  }
}
 
class DumpCheckpoint FINAL : public Closure {
 public:
  explicit DumpCheckpoint(std::ostream* os)
      : os_(os), barrier_(0), backtrace_map_(BacktraceMap::Create(getpid())) {}
  void Run(Thread* thread) OVERRIDE {
    // Note thread and self may not be equal if thread was already suspended at the point of the
    // request.
    Thread* self = Thread::Current();
    std::ostringstream local_os;
    {
      ScopedObjectAccess soa(self);
      thread->Dump(local_os, backtrace_map_.get()); // 可以看到真正的thread dump是在这里,所以每个线程的dump都是通过DumpCheckPoint的Run函数进行的;
    }
    local_os << "\n";
    {
      // Use the logging lock to ensure serialization when writing to the common ostream.
      MutexLock mu(self, *Locks::logging_lock_);
      *os_ << local_os.str();
    }
    barrier_.Pass(self); // 每个线程在Run函数中Dump thread完成后,通知当前Barrier对其成员count减一,所以当Barrier的count为0时,说明所有的线程已经完成的dump
  }
  void WaitForThreadsToRunThroughCheckpoint(size_t threads_running_checkpoint) {
    Thread* self = Thread::Current();
    ScopedThreadStateChange tsc(self, kWaitingForCheckPointsToRun);
    bool timed_out = barrier_.Increment(self, threads_running_checkpoint, kDumpWaitTimeout); // 初始化一个barrier,计数需要进行Dump的线程总个数count,这个个数由上面的调用传递;并设置Wait 超时;
    if (timed_out) { // 如果Wait超时,说明还有thread Dump没有完成,此时Barrier的计数器count的值应该值大于0的
      // Avoid a recursive abort.
      LOG((kIsDebugBuild && (gAborting == 0)) ? FATAL : ERROR)
          << "Unexpected time out during dump checkpoint.";
    }
  }
 private:
  // The common stream that will accumulate all the dumps.
  std::ostream* const os_;
  // The barrier to be passed through and for the requestor to wait upon.
  Barrier barrier_;
  // A backtrace map, so that all threads use a shared info and don't reacquire/parse separately.
  std::unique_ptr<BacktraceMap> backtrace_map_;
};

即,Dump Thread list 是通过每个thread执行DumpCheckpoint来Dump 各个thread的状态和backtrace的;

看下每个Thread是如何执行DumpCheckPoint的:

 

size_t ThreadList::RunCheckpoint(Closure* checkpoint_function) {
  Thread* self = Thread::Current();
  Locks::mutator_lock_->AssertNotExclusiveHeld(self);
  Locks::thread_list_lock_->AssertNotHeld(self);
  Locks::thread_suspend_count_lock_->AssertNotHeld(self);
  if (kDebugLocking && gAborting == 0) {
    CHECK_NE(self->GetState(), kRunnable);
  }
  std::vector<Thread*> suspended_count_modified_threads;
  size_t count = 0;
  {
    // 第一步:Runnable线程和Suspended线程区分对待
    // Call a checkpoint function for each thread, threads which are suspend get their checkpoint
    // manually called.这里已经说明,让每个thread执行 CheckPoint函数,对于Suspend的线程,我们手动帮它们调用 CheckPoint函数;
    MutexLock mu(self, *Locks::thread_list_lock_);
    MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
    count = list_.size();
    for (const auto& thread : list_) {
      if (thread != self) {
        while (true) {
          // 对于Runnable的线程,把checkpoint_function设置到当前线程的 CheckPoint function列表中,当线程执行到CheckPoint时,会执行该CheckPoint function
          if (thread->RequestCheckpoint(checkpoint_function)) {
            // This thread will run its checkpoint some time in the near future.
            break;
          } else {
            // We are probably suspended, try to make sure that we stay suspended.
            // The thread switched back to runnable.
            if (thread->GetState() == kRunnable) {
              // Spurious fail, try again.
              continue;
            }
            // 对于suspended线程,放到一个集合里,稍后单独处理,为了防止处理过成中线程状态改变,影响处理,在这里把线程的suspend count +1,
            // 这样即便线程原有的suspended Request结束时,suspend count仍然不为0,无法进入Runnable状态
            thread->ModifySuspendCount(self, +1, false);
            suspended_count_modified_threads.push_back(thread);
            break;
          }
        }
      }
    }
  }
  // Run the checkpoint on ourself while we wait for threads to suspend.
  checkpoint_function->Run(self); // 对于Signal Catcher线程,在这里进行 CheckPoint function的Run函数调用,进行Thread dump
  // Run the checkpoint on the suspended threads.
  for (const auto& thread : suspended_count_modified_threads) {
    if (!thread->IsSuspended()) {
      if (ATRACE_ENABLED()) {
        std::ostringstream oss;
        thread->ShortDump(oss);
        ATRACE_BEGIN((std::string("Waiting for suspension of thread ") + oss.str()).c_str());
      }
      // Busy wait until the thread is suspended.
      const uint64_t start_time = NanoTime();
      do {
        ThreadSuspendSleep(kThreadSuspendInitialSleepUs);
      } while (!thread->IsSuspended());
      const uint64_t total_delay = NanoTime() - start_time;
      // Shouldn't need to wait for longer than 1000 microseconds.
      constexpr uint64_t kLongWaitThreshold = MsToNs(1);
      ATRACE_END();
      if (UNLIKELY(total_delay > kLongWaitThreshold)) {
        LOG(WARNING) << "Long wait of " << PrettyDuration(total_delay) << " for "
            << *thread << " suspension!";
      }
    }
    // We know for sure that the thread is suspended at this point.
    checkpoint_function->Run(thread); // 对于第一步中统计的suspende线程,目前无法运行,我们手动对每个线程执行CheckPoint function的Run函数,传递的参数是将要进行dump的thread;
    {
      MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
      thread->ModifySuspendCount(self, -1, false); // 当前thread dump 完成后,将其suspend count -1,不在需要保持suspend状态了;
    }
  }
  {
    // Imitate ResumeAll, threads may be waiting on Thread::resume_cond_ since we raised their
    // suspend count. Now the suspend_count_ is lowered so we must do the broadcast.
    MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
    Thread::resume_cond_->Broadcast(self); // 通知那些suspended线程,可以Resume了;
  }
  return count;
}

 

在这里有两个点需要解释下:

1.线程的kRunnable状态和Suspended状态:

enum ThreadState {
  //                                   Thread.State   JDWP state
  kTerminated = 66,                 // TERMINATED     TS_ZOMBIE    Thread.run has returned, but Thread* still around
  kRunnable,                        // RUNNABLE       TS_RUNNING   runnable
  kTimedWaiting,                    // TIMED_WAITING  TS_WAIT      in Object.wait() with a timeout
  kSleeping,                        // TIMED_WAITING  TS_SLEEPING  in Thread.sleep()
  kBlocked,                         // BLOCKED        TS_MONITOR   blocked on a monitor
  kWaiting,                         // WAITING        TS_WAIT      in Object.wait()
  kWaitingForGcToComplete,          // WAITING        TS_WAIT      blocked waiting for GC
  kWaitingForCheckPointsToRun,      // WAITING        TS_WAIT      GC waiting for checkpoints to run
  kWaitingPerformingGc,             // WAITING        TS_WAIT      performing GC
  kWaitingForDebuggerSend,          // WAITING        TS_WAIT      blocked waiting for events to be sent
  kWaitingForDebuggerToAttach,      // WAITING        TS_WAIT      blocked waiting for debugger to attach
  kWaitingInMainDebuggerLoop,       // WAITING        TS_WAIT      blocking/reading/processing debugger events
  kWaitingForDebuggerSuspension,    // WAITING        TS_WAIT      waiting for debugger suspend all
  kWaitingForJniOnLoad,             // WAITING        TS_WAIT      waiting for execution of dlopen and JNI on load code
  kWaitingForSignalCatcherOutput,   // WAITING        TS_WAIT      waiting for signal catcher IO to complete
  kWaitingInMainSignalCatcherLoop,  // WAITING        TS_WAIT      blocking/reading/processing signals
  kWaitingForDeoptimization,        // WAITING        TS_WAIT      waiting for deoptimization suspend all
  kWaitingForMethodTracingStart,    // WAITING        TS_WAIT      waiting for method tracing to start
  kWaitingForVisitObjects,          // WAITING        TS_WAIT      waiting for visiting objects
  kWaitingForGetObjectsAllocated,   // WAITING        TS_WAIT      waiting for getting the number of allocated objects
  kStarting,                        // NEW            TS_WAIT      native thread started, not yet ready to run managed code
  kNative,                          // RUNNABLE       TS_RUNNING   running in a JNI native method
  kSuspended,                       // RUNNABLE       TS_RUNNING   suspended by GC or debugger
};

其中,thread在运行的3中状态:

kRunnable, // 正在运行,可能会存在heap上的内存分配和 java函数跳转

kNative,  // 是指在执行 Jni Native method,不会影响Java堆 heap的分配和GC,不存在java函数跳转

kSuspended, //线程其实是在Runnable中 Wait,wait resume condition

kRunnable是指当前线程正在运行,

kSuspended是指当前线程从其他状态要切换到kRunnable状态时,检查当前线程是否有kSuspendRequest,

如果有suspend Request,则进行wait,代码不在继续执行,线程变成kSuspended状态,直到 Suspend count发生变化,变为0后才会切换到Runnable状态;

这也是为什么GC的时候需要 SuspendAll线程,因为Suspend后,此时的heap是被锁定的,不存在对java heap的操作,以便来进行GC线程操作heap;

 

2.CheckPoint

提到CheckPoint必须要提到safe point;

safepoint:对于ART编译的代码,可以定期轮询当前Runtime来确认是否需要执行某些特定代码;可以认为这些轮询时的点,就是safepoint;

safepoint可以用来实现暂定一个java线程,也可以用来实现Checkpoint机制;

比如:当正在执行java代码的线程A执行到safepoint时,会执行CheckSuspend函数,在发现当前线程有 checkpoint request时,

会在这个点执行线程的CheckPoint函数;如果发现当前线程有suspend request时,会进行SuspendCheck,使得线程进入Suspend状态(暂停);

所以说,ART CheckPoint应该是safepoint的一个功能实现;

 

下面引用网上一段话:

作者:RednaxelaFX
链接:https://www.zhihu.com/question/48996839/answer/113801448
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

从编译器和解释器的角度看,ART的safepoint有两种:
  • 主动safepoint:编译生成的代码里或者解释代码里有主动检查safepoint的动作,并在发现需要进入safepoint时跳转到相应的处理程序里。
    • ART的解释器安插主动safepoint的位置在循环的回跳处(backedge,具体来说是在跳转前的源头处)以及方法返回处(return / throw exception)。
    • ART Optimizing Compiler安插主动safepoint的位置在循环回跳处(backedge,具体来说是在跳转前的源头处)以及方法入口处(entry)。
  • 被动safepoint:所有未内联的方法调用点(call site)都是被动safepoint。这里并没有任何需要主动执行的代码,而就是个普通的方法调用。
    • 之所以要作为safepoint,是因为执行到方法调用点之后,控制就交给了被调用的方法,而被调用的方法可能会进入safepoint,safepoint中可能需要遍历栈帧,因此caller也必须处于safepoint。

安插safepoint的位置的思路是:程序要能够在runtime发出需要safepoint的请求后,及时地执行到最近的safepoint然后把控制权交给runtime。
怎样算“及时”?只要执行时间是有上限(bounded)就可以了,实时性要求并不是很高

于是进一步假设,向前执行(直线型、带条件分支都算)的代码都会在有限时间内执行完,所以可以不用管;而可能导致长时间执行的代码,要么是循环,要么是方法调用,所以只要在这两种地方插入safepoint就可以保证及时性了。
至于具体在方法入口还是出口、循环回边的源头还是目标处插入safepoint,这是个具体实现的细节,只要选择一边插入就可以了。
所以,对于前面的一行代码:
          // 对于Runnable的线程,把checkpoint_function设置到当前线程的 CheckPoint function列表中,当线程执行到CheckPoint时,会执行该CheckPoint function
          if (thread->RequestCheckpoint(checkpoint_function)) { 

处于Runnable的线程,我们设置了checkpoint_function和 CheckPoint Request,那么这个线程终归要执行到CheckPoint,从而执行check_point function.

前面提到safepoint的实时性要求不高,可以给个时间概念,一个函数的运行时间之内肯定会执行到CheckPoint;

但也会受到其他因素的影响,比如线程调度,假如一个线程A在Runnable状态,将要执行到safepoint,但此时,该线程不在得到调度,就会一直执行不到safepoint;

 

正对本例中,正常情况下的流程是:Runnable的线程在执行到safepoint时,发现有CheckPoint请求,从而执行CheckPoint函数,

此处CheckPoint函数已经被设置了 DumpCheckPoint的Run()函数,从而进行thread dump;

 

至此,suspended 状态和 Runnable状态的线程的Dump调用点都说清楚了。

 

6.Dump thread

6.1 先详细看一下 Thread信息的Dump:

void Thread::Dump(std::ostream& os, BacktraceMap* backtrace_map) const {
  DumpState(os); //Dump thread 状态信息
  DumpStack(os, backtrace_map); // Dump thread kernel/native/java stack
}

thread 的状态信息如下一个例子:

"Signal Catcher" daemon prio=5 tid=3 Runnable
  | group="system" sCount=0 dsCount=0 obj=0x32c050d0 self=0x7f97dd1400
  | sysTid=9729 nice=0 cgrp=default sched=0/0 handle=0x7fa200e450
  | state=R schedstat=( 217991249 1074429 82 ) utm=15 stm=6 core=4 HZ=100
  | stack=0x7fa1f14000-0x7fa1f16000 stackSize=1005KB
  | held mutexes= "mutator lock"(shared held)

第1行:"Signal Catcher":线程名称,daemon:是否是daemon线程(如果不是,则不打印“daemon”),prio=5:java线程Thread对象中的优先级,tid=3:vm中对应的 threadid,Runnable:线程在虚拟机中的状态;(如果当前线程没有attach,则第一行显示: “name” prio=num (not attached));

第2行:group: ThreadGroup,sCount: Suspend count, dsCount: debugger suspend count(小于等于sCount),obj:对应java线程 java.lang.Thread对象,self:native 对应的 thread 指针;

第3行:sysTid:对应linux线程 tid, nice:线程调度执行优先级,cgrp: cgroup,cpu调度group,sched:调度策略和调度优先级,handle:当前线程对应的pthread_t

nice:

线程调度优先级(getpriority获取),-20 ~ 20 之间,越小,优先级越高, -1代表获取优先级失败;

 

cgrp:

cat /proc/self/task/%d/cgroup,

5:freezer:/
4:cpuset:/background
3:cpu:/bg_non_interactive
2:memory:/
1:cpuacct:/uid_10024/pid_6850

cgrp=bg_non_interactive

 

sched调度策略:

#define SCHED_NORMAL            0

#define SCHED_OTHER             0

#define SCHED_FIFO              1

#define SCHED_RR                2

 SCHED_OTHER
  它是默认的线程分时调度策略,所有的线程的优先级别都是0,线程的调度是通过分时来完成的。简单地说,如果系统使用这种调度策略,程序将无法设置线程的优先级。请注意,这种调度策略也是抢占式的,当高优先级的线程准备运行的时候,当前线程将被抢占并进入等待队列。这种调度策略仅仅决定线程在可运行线程队列中的具有相同优先级的线程的运行次序。
  SCHED_FIFO
  它是一种实时的先进先出调用策略,且只能在超级用户下运行。这种调用策略仅仅被使用于优先级大于0的线程。它意味着,使用SCHED_FIFO的可运行线程将一直抢占使用SCHED_OTHER的运行线程J。此外SCHED_FIFO是一个非分时的简单调度策略,当一个线程变成可运行状态,它将被追加到对应优先级队列的尾部((POSIX 1003.1)。当所有高优先级的线程终止或者阻塞时,它将被运行。对于相同优先级别的线程,按照简单的先进先运行的规则运行。我们考虑一种很坏的情况,如果有若干相同优先级的线程等待执行,然而最早执行的线程无终止或者阻塞动作,那么其他线程是无法执行的,除非当前线程调用如pthread_yield之类的函数,所以在使用SCHED_FIFO的时候要小心处理相同级别线程的动作。
  SCHED_RR
  鉴于SCHED_FIFO调度策略的一些缺点,SCHED_RR对SCHED_FIFO做出了一些增强功能。从实质上看,它还是SCHED_FIFO调用策略。它使用最大运行时间来限制当前进程的运行,当运行时间大于等于最大运行时间的时候,当前线程将被切换并放置于相同优先级队列的最后。这样做的好处是其他具有相同级别的线程能在“自私“线程下执行。返回值  0表示设置成功 其他表示设置不成功


第4行:state:linux线程的状态,schedstat:线程调度情况,utm=15:线程在用户态运行的时间, stm=6:线程在内核态运行的时间, core=4:线程最后运行在哪个cpu上, HZ=100:系统时钟频率

state=R 任务的状态,R:running, S:sleeping (TASK_INTERRUPTIBLE), D:disk sleep (TASK_UNINTERRUPTIBLE), T: stopped, T:tracing stop,Z:zombie, X:dead

schedstat:cat /proc/self/task/%d/schedstat

schedstat=( 217991249 1074429 82 ) 表示:(累计运行的物理时间(ns)   累计在就绪队列里的等待时间   主动切换和被动切换的累计次数)

state,utm, stm等从 /proc/self/task/%d/stat 中获取

 * struct task_cputime - collected CPU time counts

 * @utime:        time spent in user mode, in &cputime_t units

 * @stime:        time spent in kernel mode, in &cputime_t units

 * @sum_exec_runtime:    total time spent on the CPU, in nanoseconds

utm,stm 单位是jiffies,时钟中断次数;

频率是周期的倒数,一般是一秒钟中断产生的次数,所以 1/100 = 0.01s = 10ms, 每10ms产生一次中断;

 

第5行:stack=0x7fa1f14000-0x7fa1f16000 stackSize=1005KB

线程栈的start 和 end,以及 stack size;

 

第6行:held mutexes= "mutator lock"(shared held)

线程持有的当前虚拟机中的mutex的名称,及持有方式:shared held: 共享锁,exclusive held:独占锁;

每个线程在完成suspend时,都会把 “mutator lock”释放;

实际上,Suspend所有线程时,判断是否suspend完成,就是通过获取"mutator lock"独占锁来判断的,

如果能获取独占锁,说明其他线程都不再 独占/共享 持有 "mutator lock" ,说明所有线程suspend已经完成。

 

6.2 接下来是 Thread Dump backtrace:

DumpKernelStack:

实际是从 /proc/self/task/%d/stack 读取kernel stack后,去除地址;

gemini:/ # cat /proc/10749/task/10749/stack

[<0000000000000000>] __switch_to+0x70/0x7c

[<0000000000000000>] SyS_epoll_wait+0x2ac/0x370

[<0000000000000000>] SyS_epoll_pwait+0xa4/0x118

[<0000000000000000>] el0_svc_naked+0x24/0x28

[<0000000000000000>] 0xffffffffffffffff

 

DumpNativeStack:

Backtrace->Unwind 来获取 backtrace并打印 pc offset和Method name:

  native: #00 pc 000000000001beec  /system/lib64/libc.so (syscall+28)

  native: #01 pc 00000000000e6dd4  /system/lib64/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+160)

  native: #02 pc 000000000031a354  /system/lib64/libart.so (_ZN3art12ProfileSaver3RunEv+296)

  native: #03 pc 000000000031ba6c  /system/lib64/libart.so (_ZN3art12ProfileSaver21RunProfileSaverThreadEPv+100)

  native: #04 pc 00000000000681a4  /system/lib64/libc.so (_ZL15__pthread_startPv+196)

  native: #05 pc 000000000001db80  /system/lib64/libc.so (__start_thread+16)

 

 DumpJavaStack:

使用StackVisitor进行dump ;

 

总结:

1.Thread 信息的Dump是通过 CheckPoint 来实现的

2.kRunnable和kSuspended状态的线程 CheckPointFunction的调用有所不同

3.使用Barrier统计线程的CheckPointFunction是否执行完成,count表示剩余的还没执行完CheckPointFunction的线程个数

4.Thread state 信息的解读,backtrace的获取

没有更多推荐了,返回首页