android java process stack OOM

Introduction

近期遇到了一个有意思的问题,客户平台在压测时,launcher重启,出错log如下:

11-30 04:19:24.843 W/libc    ( 4819): pthread_create failed: clone failed: Try again
11-30 04:19:24.844 E/art     ( 4819): Throwing OutOfMemoryError "pthread_create (1040KB stack) failed: Try again"
11-30 04:19:24.850 E/AndroidRuntime( 4819): FATAL EXCEPTION: AsyncTask #2
11-30 04:19:24.850 E/AndroidRuntime( 4819): Process: com.****.launcher, PID: 4819
11-30 04:19:24.850 E/AndroidRuntime( 4819): java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Try again
11-30 04:19:24.850 E/AndroidRuntime( 4819):     at java.lang.Thread.nativeCreate(Native Method) 
11-30 04:19:24.850 E/AndroidRuntime( 4819):     at java.lang.Thread.start(Thread.java:1063)
11-30 04:19:24.850 E/AndroidRuntime( 4819):     at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:920)
11-30 04:19:24.850 E/AndroidRuntime( 4819):     at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:988)
11-30 04:19:24.850 E/AndroidRuntime( 4819):     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
11-30 04:19:24.850 E/AndroidRuntime( 4819):     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
11-30 04:19:24.850 E/AndroidRuntime( 4819):     at java.lang.Thread.run(Thread.java:818)

out of memory了吗?报的异常是OutOfMemoryError, 直接感觉是内存泄露了。但是压测时同时dump内存使用信息,并没有发现异常。

接着注意到errno是EAGAIN,并不是ENOMEM。是pthread_create时出错,查看报错时的art代码,如下:

art/runtime/thread.cc

void Thread::CreateNativeThread(JNIEnv* env, jobject java_peer, size_t stack_size, bool is_daemon) {
  CHECK(java_peer != nullptr);
  Thread* self = static_cast<JNIEnvExt*>(env)->self;
  Runtime* runtime = Runtime::Current();

  // Atomically start the birth of the thread ensuring the runtime isn't shutting down.
  bool thread_start_during_shutdown = false;
  {
    MutexLock mu(self, *Locks::runtime_shutdown_lock_);
    if (runtime->IsShuttingDownLocked()) {
      thread_start_during_shutdown = true;
    } else {
      runtime->StartThreadBirth();
    }
  }
  if (thread_start_during_shutdown) {
    ScopedLocalRef<jclass> error_class(env, env->FindClass("java/lang/InternalError"));
    env->ThrowNew(error_class.get(), "Thread starting during runtime shutdown");
    return;
  }

  Thread* child_thread = new Thread(is_daemon);
  // Use global JNI ref to hold peer live while child thread starts.
  child_thread->tlsPtr_.jpeer = env->NewGlobalRef(java_peer);
  stack_size = FixStackSize(stack_size);

  // Thread.start is synchronized, so we know that nativePeer is 0, and know that we're not racing to
  // assign it.
  env->SetLongField(java_peer, WellKnownClasses::java_lang_Thread_nativePeer,
                    reinterpret_cast<jlong>(child_thread));

  pthread_t new_pthread;
  pthread_attr_t attr;
  CHECK_PTHREAD_CALL(pthread_attr_init, (&attr), "new thread");
  CHECK_PTHREAD_CALL(pthread_attr_setdetachstate, (&attr, PTHREAD_CREATE_DETACHED), "PTHREAD_CREATE_DETACHED");
  CHECK_PTHREAD_CALL(pthread_attr_setstacksize, (&attr, stack_size), stack_size);
  int pthread_create_result = pthread_create(&new_pthread, &attr, Thread::CreateCallback, child_thread);
  CHECK_PTHREAD_CALL(pthread_attr_destroy, (&attr), "new thread");

  if (pthread_create_result != 0) {
    // pthread_create(3) failed, so clean up.
    {
      MutexLock mu(self, *Locks::runtime_shutdown_lock_);
      runtime->EndThreadBirth();
    }
    // Manually delete the global reference since Thread::Init will not have been run.
    env->DeleteGlobalRef(child_thread->tlsPtr_.jpeer);
    child_thread->tlsPtr_.jpeer = nullptr;
    delete child_thread;
    child_thread = nullptr;
    // TODO: remove from thread group?
    env->SetLongField(java_peer, WellKnownClasses::java_lang_Thread_nativePeer, 0);
    {
      std::string msg(StringPrintf("pthread_create (%s stack) failed: %s",
                                   PrettySize(stack_size).c_str(), strerror(pthread_create_result)));
      ScopedObjectAccess soa(env);
      soa.Self()->ThrowOutOfMemoryError(msg.c_str());
    }
  }
}

从log调用栈上可以发现,都是在调用pthread_create创建线程的时候出错的。所以这不是椎内存泄露,而可能是栈内存不足造成的。

android java thread stack size

那么android上java thread默认的stack size是多少呢?

Dalvik上把java和native stack分开了,默认java stack是32KB,native stak是1MB。通过情况下,art 上java thread stack的大小和dalvik是一样的。

stack 空间在thread创建时分配,在thread 退出时回收。1M+的stack,是一个非常大的空间。stack上的内存不需要GC,因为内存会在函数退出时回收。

从代码上也可以计算到art 上java thread stack 的大小。从上面的CreateNativeThread方法看到,stack_size是由下面这个方法计算得到的。默认情况下,传进来的size是0,使用默认大小。

stack_size = FixStackSize(stack_size);

static size_t FixStackSize(size_t stack_size) {
  // A stack size of zero means "use the default".
  if (stack_size == 0) {
    stack_size = Runtime::Current()->GetDefaultStackSize();
  }

  // Dalvik used the bionic pthread default stack size for native threads,
  // so include that here to support apps that expect large native stacks.
  stack_size += 1 * MB;

  // It's not possible to request a stack smaller than the system-defined PTHREAD_STACK_MIN.
  if (stack_size < PTHREAD_STACK_MIN) {
    stack_size = PTHREAD_STACK_MIN;
  }

  if (Runtime::Current()->ExplicitStackOverflowChecks()) {
    // It's likely that callers are trying to ensure they have at least a certain amount of
    // stack space, so we should add our reserved space on top of what they requested, rather
    // than implicitly take it away from them.
    stack_size += GetStackOverflowReservedBytes(kRuntimeISA);
  } else {
    // If we are going to use implicit stack checks, allocate space for the protected
    // region at the bottom of the stack.
    stack_size += Thread::kStackOverflowImplicitCheckSize +
        GetStackOverflowReservedBytes(kRuntimeISA);
  }

  // Some systems require the stack size to be a multiple of the system page size, so round up.
  stack_size = RoundUp(stack_size, kPageSize);

  return stack_size;
}

Runtime::Current()->GetDefaultStackSize(); 计算的是default stack size,由 zygote fork进程时option指定。另外,还有一个为了避免stackoverflow的额外的8K。再加上native的1MB。总量是1056KB.

frameworks/base/core/jni/AndroidRuntime.cpp:667:    addOption("-XX:mainThreadStackSize=24K");

how did it happen

既然1M的空间足够大,那么上面的stack size不够是如何发生的呢?

thread的栈内存是相互独立的, 对于java,分配线程栈资源是在调用start()后开始,会调用native方法创建线程并获取相关资源,然后调用线程的run()方法。这就不难理解为什么会在创建线程时出现栈溢出了。

关键点是操作系统进程数限制通常比较大,但栈内存限制比较小。

需要注意的是,这个并不是进程栈大小,进程栈可以通过ulimit -s或ulimit -a查看。通常是8MB。

上面问题的原因可能是: launcher进程开了一个thread pool,没有设上限或上限比较大; 某些情况下,thread pool里的线程,可能由于死锁,干完活后没有退出;结果就是thread的数量,持续扩展,导致栈内存用完。

但是在随后的压测过程中,dump了launcher进程的memory和thread数。其stack稳定在544KB左右,thread 个数在80左右。都是正常的。

** MEMINFO in pid 4802 [com.***.launcher] **
                   Pss  Private  Private  Swapped     Heap     Heap     Heap   
                 Total    Dirty    Clean    Dirty     Size    Alloc     Free   
                ------   ------   ------   ------   ------   ------   ------ 
  Native Heap     6898     6864        0        0     8416     5550     2865   
  Dalvik Heap    27934    27856        0        0    37115    22421    14694  
 Dalvik Other      648      648        0        0       
        Stack      544      544        0        0       
    Other dev       17        0       16        0       
     .so mmap     4780     1468     2464      720       
    .apk mmap      456        0      240        0       
    .dex mmap     7914        0     7840        0       
    .oat mmap     1968        0      300        0       
    .art mmap     1935     1260      464        0       
   Other mmap       36        4        0        0       
    GL mtrack    12304    12304        0        0       
      Unknown      144      144        0        4       
        TOTAL    65578    51092    11324      724    45531    27971    17559  

那么pthread_create fail的原因是什么呢?

我又重新check了一个log,发现报pthread_create fail的不只一个进程。压测一段时间后,很多进程因为clone失败退出。但是有一个进程wallpaperplayer不会。Dump了一下它的内存信息,进程status如下:

Applications Memory Usage (kB):
Uptime: 62306777 Realtime: 62306777

** MEMINFO in pid 5266 [com.***.wallpaperplayer:provider] **
                   Pss  Private  Private  Swapped     Heap     Heap     Heap
                 Total    Dirty    Clean    Dirty     Size    Alloc     Free
                ------   ------   ------   ------   ------   ------   ------
  Native Heap    87994    87972        0     1944    90732    81715     9016
  Dalvik Heap    43098    43016        0      612    33056    27605     5451
 Dalvik Other     2088     2088        0       24                           
        Stack    8572    8572        0       24                           
    Other dev      549        0      548        0                           
     .so mmap     2845      200     2484     1240                           
    .dex mmap     2221        0     1768        0                           
    .oat mmap      506        0       36        0                           
    .art mmap      809      636        0      104                           
   Other mmap       67       56        0        0                           
      Unknown      208      208        0      336                           
        TOTAL   158957   152748     4836     4284   123788   109320    14467

 Objects
               Views:        0         ViewRootImpl:        0
         AppContexts:        3           Activities:        0
              Assets:        6        AssetManagers:        6
       Local Binders:        6        Proxy Binders:       15
       Parcel memory:     1153         Parcel count:     4614
    Death Recipients:        3      OpenSSL Sockets:        0

 SQL
         MEMORY_USED:        0
  PAGECACHE_OVERFLOW:        0          MALLOC_SIZE:        0


cat /proc/5266/status
Name:   player:provider
State:  S (sleeping)
Tgid:   5266
Pid:    5266
PPid:   3906
TracerPid:      0
Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000
Ngid:   0
FDSize: 128
Groups: 1007 1015 1023 1028 2001 3001 3002 3003 9997 41000 
VmPeak:  4054120 kB
VmSize:  4049904 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     92168 kB
VmRSS:     23060 kB
VmData:  2640412 kB
VmStk:      8196 kB
VmExe:        12 kB
VmLib:     74428 kB
VmPTE:      5488 kB
VmSwap:     3444 kB
Threads:        2333
SigQ:   2/2831
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000001204
SigIgn: 0000000000000000
SigCgt: 00000002000094f8
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
Seccomp:        0
Cpus_allowed:   f
Cpus_allowed_list:      0-3
voluntary_ctxt_switches:        11872
nonvoluntary_ctxt_switches:     3423

从上面的信息中,可以看到wallpaperplayer这个进程的thread个数是2333个。stack size将近8M。而系统的资源限制如下,进程数2831,stack 8M。

time(cpu-seconds)    unlimited
file(blocks)         unlimited
coredump(blocks)     0
data(KiB)            unlimited
stack(KiB)           8192
lockedmem(KiB)       64
nofiles(descriptors) 1024
processes            2831
sigpending           2831
msgqueue(bytes)      819200
maxnice              40
maxrtprio            0
resident-set(KiB)    unlimited
address-space(KiB)   unlimited

所以根本原因是com.*.wallpaperplayer:provider这个进程clone了太多的thread 达到了系统限制导致的。

summary

总结下,android java process stack oom的原因可能有以下几种:

  1. thread leak进程数达到系统限制
  2. thread 函数嵌套太深,用完1M的stack
发布了9 篇原创文章 · 获赞 1 · 访问量 7486
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览