最近,参与公司开发一项目,为提高Server端的执行效率,将Server程序设计为多线程结构。在一次测试中发现了Server无任何响应的问题,我的第一判断是Server程序出现了死锁。于是,使用pstack命令查看各线程的堆栈状态。
# pstack 进程号
Thread 9 (Thread 0x7fe82b43a700 (LWP 29656)):
#0 0x00007fe82f3681bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fe82f363d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2 0x00007fe82f363c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe8301f671d in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#4 0x00007fe82d6e1a49 in __run_exit_handlers () from /lib64/libc.so.6
#5 0x00007fe82d6e1a95 in exit () from /lib64/libc.so.6
#6 0x00000000004048c3 in printRecvSignalNum (sign=<optimized out>) at AC.c:257
#7 <signal handler called>
#8 0x00007fe82f3681bb in __lll_lock_wait () from /lib64/libpthread.so.0
#9 0x00007fe82f363d02 in _L_lock_791 () from /lib64/libpthread.so.0
#10 0x00007fe82f363c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#11 0x000000000041f802 in cronometer (arg=<optimized out>) at timerlib.c:266
#12 0x00007fe82f361dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fe82d7a073d in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7fe82ac39700 (LWP 29671)):
#0 0x00007fe82d7ae0fc in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007fe82d72bf93 in _L_lock_14932 () from /lib64/libc.so.6
#2 0x00007fe82d729013 in malloc () from /lib64/libc.so.6
#3 0x00007fe8301f4078 in _dl_map_object_deps () from /lib64/ld-linux-x86-64.so.2
#4 0x00007fe8301fa6db in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#5 0x00007fe8301f5ff4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6 0x00007fe8301f9feb in _dl_open () from /lib64/ld-linux-x86-64.so.2
#7 0x00007fe82d7dafc2 in do_dlopen () from /lib64/libc.so.6
#8 0x00007fe8301f5ff4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#9 0x00007fe82d7db082 in __libc_dlopen_mode () from /lib64/libc.so.6
#10 0x00007fe82d7b4565 in init () from /lib64/libc.so.6
#11 0x00007fe82f366bb0 in pthread_once () from /lib64/libpthread.so.0
#12 0x00007fe82d7b467c in backtrace () from /lib64/libc.so.6
#13 0x00007fe82eb4ead9 in procAssertStackInfo () at cc_common.c:545
#14 0x00007fe82eb4f240 in procAssertEntry (file=0x0, func=0x0, line=0, exp_str=0x0, sign=11) at cc_common.c:597
#15 <signal handler called>
#16 0x00007fe82d72477d in malloc_consolidate () from /lib64/libc.so.6
#17 0x00007fe82d726385 in _int_malloc () from /lib64/libc.so.6
#18 0x00007fe82d729a14 in calloc () from /lib64/libc.so.6
#19 0x0000000000432287 in UpdateStasInfoIntoMySQL (listStas=listStas@entry=0x7fe82ac38b68, oldListStas=oldListStas@entry=0x7fe82ac37f90, pWtpHashNode=pWtpHashNode@entry=0x7fe81c01af10) at ACDisplay.c:3256
#20 0x0000000000432bc4 in UpdateStationListMySQL (listStas=0x7fe82ac38b68, listStas@entry=0x0, pWtpHashNode=0x7fe81c01af10, pWtpHashNode@entry=0x7fe82ac38ab8) at ACDisplay.c:3585
#21 0x0000000000434607 in UpdateStationList (listStas=0x0, listStas@entry=0x7fe82ac38b68, pWtpHashNode=0x7fe82ac38ab8, pWtpHashNode@entry=0x7fe81c01af10) at ACDisplay.c:4716
#22 0x000000000040c4c3 in ACEnterRun (pWtpHashNode=pWtpHashNode@entry=0x7fe81c01af10, msgPtr=msgPtr@entry=0x7fe82ac38d70, dataFlag=CW_FALSE) at ACRunState.c:499
#23 0x00000000004061c9 in CWManageWTP (arg=arg@entry=0x7fe82ac38da8) at ACMainLoop.c:428
#24 0x00000000004069c1 in CWHandleIncomingCapwapPkg (parg=0xe7ab60) at ACMainLoop.c:497
#25 0x000000000040e27f in CWConsumerThread (arg=<optimized out>) at Scheduler.c:234
#26 0x00007fe82f361dc5 in start_thread () from /lib64/libpthread.so.0
#27 0x00007fe82d7a073d in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7fe82a438700 (LWP 29672)):
#0 0x00007fe82f3681bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fe82f363d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fe82f363c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000041a5d5 in CWThreadMutexLock (theMutex=theMutex@entry=0xc5f228 <g_wtp_data_hash+5594600>) at CWThread.c:157
#4 0x000000000040697e in CWHandleIncomingCapwapPkg (parg=0xe7d500) at ACMainLoop.c:483
#5 0x000000000040e27f in CWConsumerThread (arg=<optimized out>) at Scheduler.c:234
#6 0x00007fe82f361dc5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fe82d7a073d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fe829c37700 (LWP 29673)):
#0 0x00007fe82d7ae0fc in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007fe82d72b991 in _L_lock_4780 () from /lib64/libc.so.6
#2 0x00007fe82d7251f8 in _int_free () from /lib64/libc.so.6
#3 0x000000000041fc65 in timer_rem (id=8522, free_arg=0x41a367 <CWTimerFreeArgSingleThread>) at timerlib.c:524
#4 0x000000000041adf8 in CWTimerCancelSingleThread (idPtr=<optimized out>) at CWThread.c:909
#5 0x000000000040be4f in CWStopNeighborDeadTimer (pWtpManData=<optimized out>) at ACRunState.c:1920
#6 0x000000000040be91 in CWRestartNeighborDeadTimer (pWtpManData=0x7fe814065720) at ACRunState.c:1935
#7 0x000000000040c08e in ACEnterRun (pWtpHashNode=pWtpHashNode@entry=0x7fe814064be0, msgPtr=msgPtr@entry=0x7fe829c36d70, dataFlag=CW_FALSE) at ACRunState.c:259
#8 0x00000000004061c9 in CWManageWTP (arg=arg@entry=0x7fe829c36da8) at ACMainLoop.c:428
#9 0x00000000004069c1 in CWHandleIncomingCapwapPkg (parg=0xe7d460) at ACMainLoop.c:497
#10 0x000000000040e27f in CWConsumerThread (arg=<optimized out>) at Scheduler.c:234
#11 0x00007fe82f361dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fe82d7a073d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fe829436700 (LWP 29674)):
#0 0x00007fe82f3681bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fe82f363d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fe82f363c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000041a5d5 in CWThreadMutexLock (theMutex=theMutex@entry=0xc5f228 <g_wtp_data_hash+5594600>) at CWThread.c:157
#4 0x000000000040697e in CWHandleIncomingCapwapPkg (parg=0xe7d2a0) at ACMainLoop.c:483
#5 0x000000000040e27f in CWConsumerThread (arg=<optimized out>) at Scheduler.c:234
#6 0x00007fe82f361dc5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fe82d7a073d in clone () from /lib64/libc.so.6
从pstack结果可以看出,线程9到线程5已经死锁。接下来,详细分析各线程的状态。经分析发现真正引起死锁的源头是线程8。
线程8:
执行过程有一步调用了calloc 函数,向linux系统申请堆栈空间。
#18 0x00007fe82d729a14 in calloc () from /lib64/libc.so.6
在malloc尚未完成的时候,该线程接收到了sign 11
#14 0x00007fe82eb4f240 in procAssertEntry (file=0x0, func=0x0, line=0, exp_str=0x0, sign=11) at cc_common.c:597
在我们项目中,已将信号11重载,并在信号处理函数中调用了backtrace函数,而backtrace在执行中会调用malloc函数
#2 0x00007fe82d729013 in malloc () from /lib64/libc.so.6
因此,我们知道了死锁原因,当malloc正在执行时,被信号11打断而去执行了backtrace函数,backtrace函数中又调用了malloc函数,此种情况,堆栈锁被连续lock了两次,因而线程8被阻塞,并且,在此之后任何线程都无法获取堆栈锁,会导致其它线程阻塞在诸如malloc或free的操作上。现在,我们查看下是否有线程阻塞在诸如malloc或free的操作上?
Thread 6 (Thread 0x7fe829c37700 (LWP 29673)):
#0 0x00007fe82d7ae0fc in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007fe82d72b991 in _L_lock_4780 () from /lib64/libc.so.6
#2 0x00007fe82d7251f8 in _int_free () from /lib64/libc.so.6
可见,线程6被阻塞在free操作上;再次分析我们的代码,线程6已经占用了一个锁(公司项目中定义的),且再无机会释放。再看其它线程,线程5,7,9都在等待锁,且永远等待不到。
那么,如何解决这个问题的呢?
由以上分析,信号11是引发死锁的导火线,信号11一般是由内存越界引起,排查最近开发的代码解决掉这个错误,但Sever死锁的风险仍然存在。因此,若想从根本上解决死锁的风险,则backtrace不可以作为信号处理函数使用。
总结:信号处理函数必须是可重入函数。以下是可重入函数和不可重入函数的定义。
可重入函数:重入意味着这个函数可以重复进入,可以被并行调用,可以被中断,它只使用自身栈上的数据变量,它不依赖于任务环境,在多任务调度过程中,它是安全的,不必担心数据出错。
不可重入函数:不可重入,意味着不可被并行调度,否则会产生不可预料的结果,这些函数内一般使用了静态(static)的数据结构,使用了malloc()或者free()函数,使用了标准I/O函数等等。