在多线程启动子进程的过程中,发生了子进程启动失败,卡在了fork后并未执行exe函数退出,而是卡在了一个中间态,出现了一个现象,即,环境中出现了一个和父进程同名的进程。
查看子进程的堆栈如下:
#0 0x00007ff1d1ae6c83 in sys_futex (t=0x7ff1ac6fd1b0, v=2, o=128, a=0x7ff1d1cfdec0 <tcmalloc::Static::central_cache_+8512>) at ./src/base/linux_syscall_support.h:1787
#1 base::internal::SpinLockDelay (w=w@entry=0x7ff1d1cfdec0 <tcmalloc::Static::central_cache_+8512>, value=2, loop=loop@entry=6350) at ./src/base/spinlock_linux-inl.h:87
#2 0x00007ff1d1ae6ed7 in SpinLock::SlowLock (this=this@entry=0x7ff1d1cfdec0 <tcmalloc::Static::central_cache_+8512>) at src/base/spinlock.cc:132
#3 0x00007ff1d1ae0630 in Lock (this=0x7ff1d1cfdec0 <tcmalloc::Static::central_cache_+8512>) at src/base/spinlock.h:75
#4 tcmalloc::CentralFreeList::RemoveRange (this=0x7ff1d1cfdec0 <tcmalloc::Static::central_cache_+8512>, start=start@entry=0x7ff1ac6fd290, end=end@entry=0x7ff1ac6fd298, N=24) at src/central_freelist.cc:247
#5 0x00007ff1d1ae32f3 in tcmalloc::ThreadCache::FetchFromCentralCache (this=0x13badc0, cl=<optimized out>, byte_size=96) at src/thread_cache.cc:162
#6 0x00007ff1d1aecaf8 in Allocate (cl=<optimized out>, size=<optimized out>, this=<optimized out>) at src/thread_cache.h:341
#7 do_malloc (size=<optimized out>) at src/tcmalloc.cc:1068
#8 cpp_alloc (nothrow=false, size=88) at src/tcmalloc.cc:1354
#9 tc_newarray (size=88) at src/tcmalloc.cc:1560
#10 0x00007ff1d33d6840 in Process::launch(char const*, int, char**) ()
卡在了内存管理库tcmalloc的锁上。
查看tcmalloc代码,如下:
int CentralFreeList::RemoveRange(void **start, void **end, int N) {
ASSERT(N > 0);
lock_.Lock();
if (N == Static::sizemap()->num_objects_to_move(size_class_) &&
used_slots_ > 0) {
int slot = --used_slots_;
ASSERT(slot >= 0);
TCEntry *entry = &tc_slots_[slot];
*start = entry->head;
*end = entry->tail;
lock_.Unlock();
return N;
}
只要申请堆内存,就可能走到这个流程中去,那么是不是我们在fork后,在子进程中有过new的操作,查看代码如下:
Process Process::launch(const char* cmdline,int argc,char * argv[])
{
Process p;
int pid = fork();
if(pid>0){
p.pid = pid;
return p;
}
else if(pid==0){
char ** args = new char*[argc+2];
args[0] = (char*)cmdline;
args[argc+1] = NULL;
for(int i(0);i<argc;++i) {
args[i+1] = argv[i];
}
execvp(cmdline,args);
delete [] args;
exit(0);
}
return p;
}
这样流程会卡在new操作符中,无法执行后续的execvp,导致子进程死锁的现象。
搜索tcmalloc的官方issues,确实存在过相同的问题:
Originally reported on Google Code with ID 496
What steps will reproduce the problem?
Use tcmalloc in an environment where threads might call fork. The testcase
attached (test-threadfork.c) is a small example that creates a set of threads and each
thread allocates some memory, fork a allocates more memory.
Run the testcase with a higher number of threads and forks to trigger the issue.
What is the expected output? What do you see instead?
The expect output is to no deadlock occurs in the fork and all children process eventually
finish. The tcmalloc contains a bug that some internal locks are left in a undefined
state between fork, leaving the child process in a deadlock state.
What version of the product are you using? On what operating system?
I tested svn version r190 in a PPC64 and X86_64 Linux environment.
Please provide any additional information below.
The issue is the locks defined at src/static_vars.h, Static::pageheap_lock_ and each
lock from Static::CentralFreeListPadded elements, needs to be in a consistent state
in a forked version of a thread. Currently, some race issues might occurs if the following
scenario occurs:
Thread 1 | Thread 2
calls malloc() |
\_ tcmalloc lock Static::pageheap_lock_ |
| calls fork()
| calls malloc()
| \_ tcmalloc tries to lock the same lock
The same might occur with any lock from Static::central_cache_ elements as well.
A possible solution, presented in patch gperftools-atfork.patch, is register 2 functions
with pthread_atfork to lock all the locks in the parent just prior the fork() call
and to unlock all the locks after the fork() call on both the parent and child. This
patch fixes the above behavior with the testcase.
I didn't on any other platform, so we might need to add guards on non-unix platforms.
I'm accepting suggestions.
最终,通过把new操作符的操作前置解决了该故障,修改后代码如下:
Process Process::launch(const char* cmdline, int argc, char * argv[])
{
Process p;
char ** args = new char*[argc + 2];
if (!args)
{
return p;
}
args[0] = (char*)cmdline;
args[argc + 1] = NULL;
for (int i(0); i<argc; ++i) {
args[i + 1] = argv[i];
}
int pid = fork();
if (pid>0){
p.pid = pid;
delete[] args;
return p;
}
else if (pid == 0){
execvp(cmdline, args);
delete[] args;
exit(0);
}
delete[] args;
return p;
}
google官方也给出了几个解决方案:
1预先把内存申请好,然后再执行fork方法,在子进程中直接执行execvp方法,可以解决该问题。
2使用pthread_atfork需要预先知道子进程要释放哪把锁,作为入参传入。