Introduction
近期遇到了一个有意思的问题,客户平台在压测时,launcher重启,出错log如下:
11-30 04:19:24.843 W/libc ( 4819): pthread_create failed: clone failed: Try again
11-30 04:19:24.844 E/art ( 4819): Throwing OutOfMemoryError "pthread_create (1040KB stack) failed: Try again"
11-30 04:19:24.850 E/AndroidRuntime( 4819): FATAL EXCEPTION: AsyncTask #2
11-30 04:19:24.850 E/AndroidRuntime( 4819): Process: com.****.launcher, PID: 4819
11-30 04:19:24.850 E/AndroidRuntime( 4819): java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Try again
11-30 04:19:24.850 E/AndroidRuntime( 4819): at java.lang.Thread.nativeCreate(Native Method)
11-30 04:19:24.850 E/AndroidRuntime( 4819): at java.lang.Thread.start(Thread.java:1063)
11-30 04:19:24.850 E/AndroidRuntime( 4819): at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:920)
11-30 04:19:24.850 E/AndroidRuntime( 4819): at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:988)
11-30 04:19:24.850 E/AndroidRuntime( 4819): at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
11-30 04:19:24.850 E/AndroidRuntime( 4819): at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
11-30 04:19:24.850 E/AndroidRuntime( 4819): at java.lang.Thread.run(Thread.java:818)
out of memory了吗?报的异常是OutOfMemoryError, 直接感觉是内存泄露了。但是压测时同时dump内存使用信息,并没有发现异常。
接着注意到errno是EAGAIN,并不是ENOMEM。是pthread_create时出错,查看报错时的art代码,如下:
art/runtime/thread.cc
void Thread::CreateNativeThread(JNIEnv* env, jobject java_peer, size_t stack_size, bool is_daemon) {
CHECK(java_peer != nullptr);
Thread* self = static_cast<JNIEnvExt*>(env)->self;
Runtime* runtime = Runtime::Current();
// Atomically start the birth of the thread ensuring the runtime isn't shutting down.
bool thread_start_during_shutdown = false;
{
MutexLock mu(self, *Locks::runtime_shutdown_lock_);
if (runtime->IsShuttingDownLocked()) {
thread_start_during_shutdown = true;
} else {
runtime->StartThreadBirth();
}
}
if (thread_start_during_shutdown) {
ScopedLocalRef<jclass> error_class(env, env->FindClass("java/lang/InternalError"));
env->ThrowNew(error_class.get(), "Thread starting during runtime shutdown");
return;
}
Thread* child_thread = new Thread(is_daemon);
// Use global JNI ref to hold peer live while child thread starts.
child_thread->tlsPtr_.jpeer = env->NewGlobalRef(java_peer);
stack_size = FixStackSize(stack_size);
// Thread.start is synchronized, so we know that nativePeer is 0, and know that we're not racing to
// assign it.
env->SetLongField(java_peer, WellKnownClasses::java_lang_Thread_nativePeer,
reinterpret_cast<jlong>(child_thread));
pthread_t new_pthread;
pthread_attr_t attr;
CHECK_PTHREAD_CALL(pthread_attr_init, (&attr), "new thread");
CHECK_PTHREAD_CALL(pthread_attr_setdetachstate, (&attr, PTHREAD_CREATE_DETACHED), "PTHREAD_CREATE_DETACHED");
CHECK_PTHREAD_CALL(pthread_attr_setstacksize, (&attr, stack_size), stack_size);
int pthread_create_result = pthread_create(&new_pthread, &attr, Thread::CreateCallback, child_thread);
CHECK_PTHREAD_CALL(pthread_attr_destroy, (&attr), "new thread");
if (pthread_create_result != 0) {
// pthread_create(3) failed, so clean up.
{
MutexLock mu(self, *Locks::runtime_shutdown_lock_);
runtime->EndThreadBirth();
}
// Manually delete the global reference since Thread::Init will not have been run.
env->DeleteGlobalRef(child_thread->tlsPtr_.jpeer);
child_thread->tlsPtr_.jpeer = nullptr;
delete child_thread;
child_thread = nullptr;
// TODO: remove from thread group?
env->SetLongField(java_peer, WellKnownClasses::java_lang_Thread_nativePeer, 0);
{
std::string msg(StringPrintf("pthread_create (%s stack) failed: %s",
PrettySize(stack_size).c_str(), strerror(pthread_create_result)));
ScopedObjectAccess soa(env);
soa.Self()->ThrowOutOfMemoryError(msg.c_str());
}
}
}
从log调用栈上可以发现,都是在调用pthread_create创建线程的时候出错的。所以这不是椎内存泄露,而可能是栈内存不足造成的。
android java thread stack size
那么android上java thread默认的stack size是多少呢?
Dalvik上把java和native stack分开了,默认java stack是32KB,native stak是1MB。通过情况下,art 上java thread stack的大小和dalvik是一样的。
stack 空间在thread创建时分配,在thread 退出时回收。1M+的stack,是一个非常大的空间。stack上的内存不需要GC,因为内存会在函数退出时回收。
从代码上也可以计算到art 上java thread stack 的大小。从上面的CreateNativeThread方法看到,stack_size是由下面这个方法计算得到的。默认情况下,传进来的size是0,使用默认大小。
stack_size = FixStackSize(stack_size);
static size_t FixStackSize(size_t stack_size) {
// A stack size of zero means "use the default".
if (stack_size == 0) {
stack_size = Runtime::Current()->GetDefaultStackSize();
}
// Dalvik used the bionic pthread default stack size for native threads,
// so include that here to support apps that expect large native stacks.
stack_size += 1 * MB;
// It's not possible to request a stack smaller than the system-defined PTHREAD_STACK_MIN.
if (stack_size < PTHREAD_STACK_MIN) {
stack_size = PTHREAD_STACK_MIN;
}
if (Runtime::Current()->ExplicitStackOverflowChecks()) {
// It's likely that callers are trying to ensure they have at least a certain amount of
// stack space, so we should add our reserved space on top of what they requested, rather
// than implicitly take it away from them.
stack_size += GetStackOverflowReservedBytes(kRuntimeISA);
} else {
// If we are going to use implicit stack checks, allocate space for the protected
// region at the bottom of the stack.
stack_size += Thread::kStackOverflowImplicitCheckSize +
GetStackOverflowReservedBytes(kRuntimeISA);
}
// Some systems require the stack size to be a multiple of the system page size, so round up.
stack_size = RoundUp(stack_size, kPageSize);
return stack_size;
}
Runtime::Current()->GetDefaultStackSize(); 计算的是default stack size,由 zygote fork进程时option指定。另外,还有一个为了避免stackoverflow的额外的8K。再加上native的1MB。总量是1056KB.
frameworks/base/core/jni/AndroidRuntime.cpp:667: addOption("-XX:mainThreadStackSize=24K");
how did it happen
既然1M的空间足够大,那么上面的stack size不够是如何发生的呢?
thread的栈内存是相互独立的, 对于java,分配线程栈资源是在调用start()后开始,会调用native方法创建线程并获取相关资源,然后调用线程的run()方法。这就不难理解为什么会在创建线程时出现栈溢出了。
关键点是操作系统进程数限制通常比较大,但栈内存限制比较小。
需要注意的是,这个并不是进程栈大小,进程栈可以通过ulimit -s或ulimit -a查看。通常是8MB。
上面问题的原因可能是: launcher进程开了一个thread pool,没有设上限或上限比较大; 某些情况下,thread pool里的线程,可能由于死锁,干完活后没有退出;结果就是thread的数量,持续扩展,导致栈内存用完。
但是在随后的压测过程中,dump了launcher进程的memory和thread数。其stack稳定在544KB左右,thread 个数在80左右。都是正常的。
** MEMINFO in pid 4802 [com.***.launcher] **
Pss Private Private Swapped Heap Heap Heap
Total Dirty Clean Dirty Size Alloc Free
------ ------ ------ ------ ------ ------ ------
Native Heap 6898 6864 0 0 8416 5550 2865
Dalvik Heap 27934 27856 0 0 37115 22421 14694
Dalvik Other 648 648 0 0
Stack 544 544 0 0
Other dev 17 0 16 0
.so mmap 4780 1468 2464 720
.apk mmap 456 0 240 0
.dex mmap 7914 0 7840 0
.oat mmap 1968 0 300 0
.art mmap 1935 1260 464 0
Other mmap 36 4 0 0
GL mtrack 12304 12304 0 0
Unknown 144 144 0 4
TOTAL 65578 51092 11324 724 45531 27971 17559
那么pthread_create fail的原因是什么呢?
我又重新check了一个log,发现报pthread_create fail的不只一个进程。压测一段时间后,很多进程因为clone失败退出。但是有一个进程wallpaperplayer不会。Dump了一下它的内存信息,进程status如下:
Applications Memory Usage (kB):
Uptime: 62306777 Realtime: 62306777
** MEMINFO in pid 5266 [com.***.wallpaperplayer:provider] **
Pss Private Private Swapped Heap Heap Heap
Total Dirty Clean Dirty Size Alloc Free
------ ------ ------ ------ ------ ------ ------
Native Heap 87994 87972 0 1944 90732 81715 9016
Dalvik Heap 43098 43016 0 612 33056 27605 5451
Dalvik Other 2088 2088 0 24
Stack 8572 8572 0 24
Other dev 549 0 548 0
.so mmap 2845 200 2484 1240
.dex mmap 2221 0 1768 0
.oat mmap 506 0 36 0
.art mmap 809 636 0 104
Other mmap 67 56 0 0
Unknown 208 208 0 336
TOTAL 158957 152748 4836 4284 123788 109320 14467
Objects
Views: 0 ViewRootImpl: 0
AppContexts: 3 Activities: 0
Assets: 6 AssetManagers: 6
Local Binders: 6 Proxy Binders: 15
Parcel memory: 1153 Parcel count: 4614
Death Recipients: 3 OpenSSL Sockets: 0
SQL
MEMORY_USED: 0
PAGECACHE_OVERFLOW: 0 MALLOC_SIZE: 0
cat /proc/5266/status
Name: player:provider
State: S (sleeping)
Tgid: 5266
Pid: 5266
PPid: 3906
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
Ngid: 0
FDSize: 128
Groups: 1007 1015 1023 1028 2001 3001 3002 3003 9997 41000
VmPeak: 4054120 kB
VmSize: 4049904 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 92168 kB
VmRSS: 23060 kB
VmData: 2640412 kB
VmStk: 8196 kB
VmExe: 12 kB
VmLib: 74428 kB
VmPTE: 5488 kB
VmSwap: 3444 kB
Threads: 2333
SigQ: 2/2831
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000001204
SigIgn: 0000000000000000
SigCgt: 00000002000094f8
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
Seccomp: 0
Cpus_allowed: f
Cpus_allowed_list: 0-3
voluntary_ctxt_switches: 11872
nonvoluntary_ctxt_switches: 3423
从上面的信息中,可以看到wallpaperplayer这个进程的thread个数是2333个。stack size将近8M。而系统的资源限制如下,进程数2831,stack 8M。
time(cpu-seconds) unlimited
file(blocks) unlimited
coredump(blocks) 0
data(KiB) unlimited
stack(KiB) 8192
lockedmem(KiB) 64
nofiles(descriptors) 1024
processes 2831
sigpending 2831
msgqueue(bytes) 819200
maxnice 40
maxrtprio 0
resident-set(KiB) unlimited
address-space(KiB) unlimited
所以根本原因是com.*.wallpaperplayer:provider这个进程clone了太多的thread 达到了系统限制导致的。
summary
总结下,android java process stack oom的原因可能有以下几种:
- thread leak进程数达到系统限制
- thread 函数嵌套太深,用完1M的stack