一、问题
在一次测试程序的时候,突然发现程序没反应了,于是使用pstack查看进程,发现调用栈都是基本不变化,而且好几个线程都停留在pthread_mutex_lock中,怀疑是发生死锁了。
二、定位
首先使用gdb attach pid方式进入gdb,并且查看线程。
gdb attach 25659
(gdb) i threads
Id Target Id Frame
18 Thread 0x7fd55f070700 (LWP 25538) "spc" 0x00007fd55fecd6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
17 Thread 0x7fd55e86f700 (LWP 25539) "spc" 0x00007fd55fecda82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
16 Thread 0x7fd55e06e700 (LWP 25540) "spc" 0x00007fd55fecda82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
15 Thread 0x7fd55caca700 (LWP 25547) "spc" 0x00007fd55fecda82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
14 Thread 0x7fd556b0f700 (LWP 25548) "spc" 0x00007fd55fecd6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
13 Thread 0x7fd55630e700 (LWP 25549) "spc" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
12 Thread 0x7fd555b0d700 (LWP 25550) "spc" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
11 Thread 0x7fd55530c700 (LWP 25551) "spc" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
10 Thread 0x7fd554b0b700 (LWP 25556) "msgSchedule" 0x00007fd55fecda82 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
9 Thread 0x7fd54ffff700 (LWP 25557) "spc" 0x00007fd55fbf8d13 in epoll_wait () from /lib64/libc.so.6
8 Thread 0x7fd54f7fe700 (LWP 25558) "spc" 0x00007fd55fbf8977 in epoll_pwait () from /lib64/libc.so.6
7 Thread 0x7fd54effd700 (LWP 25572) "msgSchedule" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
6 Thread 0x7fd54e7fc700 (LWP 25573) "msgSchedule" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
5 Thread 0x7fd54dffb700 (LWP 25574) "msgSchedule" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
4 Thread 0x7fd54d7fa700 (LWP 25575) "msgSchedule" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
3 Thread 0x7fd54cff9700 (LWP 25576) "msgSchedule" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
2 Thread 0x7fd51ffff700 (LWP 25577) "msgSchedule" 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
* 1 Thread 0x7fd562f04740 (LWP 25537) "spc" 0x00007fd55fed1101 in sigwait () from /lib64/libpthread.so
可用看到线程2-7、11-13都是锁等待__lll_lock_wait ()。
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fd51ffff700 (LWP 25577))]
#0 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fd55fecbd02 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fd55fecbc08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000481741 in __gthread_mutex_lock (__mutex=0x76a740 <sgw::TcpServerHandler::m_ssHandlerMutex>)
at /usr/include/c++/4.8.2/x86_64-redhat-linux/bits/gthr-default.h:748
#4 0x0000000000488c34 in std::mutex::lock (this=0x76a740 <sgw::TcpServerHandler::m_ssHandlerMutex>) at /usr/include/c++/4.8.2/mutex:134
#5 0x000000000048978c in std::lock_guard<std::mutex>::lock_guard (this=0x7fd51fffd590, __m=...) at /usr/include/c++/4.8.2/mutex:414
#6 0x00000000004825c9 in sgw::TcpServerHandler::setMessage (ss=..., buffer=0x7fd51fffdb10, length=216) at RNMServerAdapter.cpp:121
#7 0x000000000049abe4 in sgw::CMIController::messageReply (this=0x7fd538002658, type=sgw::RNMServerAdapter::REP_QUERY_VOIP_INFO,
replyInfo="{\"errCode\":0, \"voipInfo\":{\"priImsi\":\"460000960345390\",\"priMsisdn\":\"8613530950520\",\"voipMsisdn\":\"852580502155159\",\"activeFlag\":\"0\",\"vlrid\":\"\"}}") at CMIController.cpp:127
#8 0x00000000004ad684 in sgw::CMIController::queryVoipInfo (this=0x7fd538002658) at CMIController.cpp:930
#9 0x00000000004b2e2f in sgw::CMIController::buinessProcess (this=0x7fd538002658) at CMIController.cpp:1216
#10 0x00000000004b1992 in sgw::CMIController::run (this=0x7fd538002658) at CMIController.cpp:1126
#11 0x00007fd5623fc2ab in Poco::PooledThread::run (this=0x7fd538003680) at src/ThreadPool.cpp:199
#12 0x00007fd5623f97fb in Poco::(anonymous namespace)::RunnableHolder::run (this=0x7fd538003410) at src/Thread.cpp:56
#13 0x00007fd5623f94cb in Poco::ThreadImpl::runnableEntry (pThread=0x7fd5380036a8) at src/Thread_POSIX.cpp:345
#14 0x00007fd55fec9dc5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fd55fbf873d in clone () from /lib64/libc.so.6
(gdb) f 4
#4 0x0000000000488c34 in std::mutex::lock (this=0x76a740 <sgw::TcpServerHandler::m_ssHandlerMutex>) at /usr/include/c++/4.8.2/mutex:134
134 int __e = __gthread_mutex_lock(&_M_mutex);
(gdb) p _M_mutex
$1 = {__data = {__lock = 2, __count = 0, __owner = 25551, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,
__next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\317c\000\000\001", '\000' <repeats 26 times>, __align = 2}
我们随便看一个线程的,这里选择的是线程2,查看线程2的堆栈信息查看锁的状态信息,从__owner = 25551中可以知道当前锁被线程号为25551的线程所占用了。从上面知道25551是线程11,我们接着查看线程11.
(gdb) thread 11
[Switching to thread 11 (Thread 0x7fd55530c700 (LWP 25551))]
#0 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fd55fed01bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fd55fecbd02 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fd55fecbc08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000481741 in __gthread_mutex_lock (__mutex=0x76a740 <sgw::TcpServerHandler::m_ssHandlerMutex>)
at /usr/include/c++/4.8.2/x86_64-redhat-linux/bits/gthr-default.h:748
#4 0x0000000000488c34 in std::mutex::lock (this=0x76a740 <sgw::TcpServerHandler::m_ssHandlerMutex>) at /usr/include/c++/4.8.2/mutex:134
#5 0x000000000048978c in std::lock_guard<std::mutex>::lock_guard (this=0x7fd55530ba90, __m=...) at /usr/include/c++/4.8.2/mutex:414
#6 0x00000000004823c2 in sgw::TcpServerHandler::~TcpServerHandler (this=0x7fd5440009a0, __in_chrg=<optimized out>)
at RNMServerAdapter.cpp:102
#7 0x0000000000482905 in sgw::TcpServerHandler::onSocketWritable (this=0x7fd5440009a0, pNf=...) at RNMServerAdapter.cpp:147
#8 0x0000000000498362 in Poco::NObserver<sgw::TcpServerHandler, Poco::Net::WritableNotification>::notify (this=0x7fd5280d9a30, pNf=
0x1130270) at /usr/local/include/Poco/NObserver.h:86
#9 0x00007fd5623a7691 in Poco::NotificationCenter::postNotification (this=0x7fd544003520, pNotification=...)
at src/NotificationCenter.cpp:76
#10 0x00007fd561c30c3d in Poco::Net::SocketNotifier::dispatch (this=0x7fd5440034e0, pNotification=0x1130270) at src/SocketNotifier.cpp:80
#11 0x00007fd561c2ce46 in Poco::Net::SocketReactor::dispatch (this=0x1130c50, pNotifier=..., pNotification=0x1130270)
at src/SocketReactor.cpp:267
#12 0x00007fd561c2cc44 in Poco::Net::SocketReactor::dispatch (this=0x1130c50, socket=..., pNotification=0x1130270)
at src/SocketReactor.cpp:243
#13 0x00007fd561c2c2ba in Poco::Net::SocketReactor::run (this=0x1130c50) at src/SocketReactor.cpp:92
#14 0x00007fd5623f97fb in Poco::(anonymous namespace)::RunnableHolder::run (this=0x11263d0) at src/Thread.cpp:56
#15 0x00007fd5623f94cb in Poco::ThreadImpl::runnableEntry (pThread=0x10f8af8) at src/Thread_POSIX.cpp:345
#16 0x00007fd55fec9dc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007fd55fbf873d in clone () from /lib64/libc.so.6
(gdb) f 4
#4 0x0000000000488c34 in std::mutex::lock (this=0x76a740 <sgw::TcpServerHandler::m_ssHandlerMutex>) at /usr/include/c++/4.8.2/mutex:134
134 int __e = __gthread_mutex_lock(&_M_mutex);
(gdb) p _M_mutex
$1 = {__data = {__lock = 2, __count = 0, __owner = 25551, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,
__next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\317c\000\000\001", '\000' <repeats 26 times>, __align = 2}
可以看出线程2也是在锁等待,但是自己又拥有了锁,怀疑是程序中同一个线程中连续使用了两次lock,于是找到如下代码死锁位置。
std::lock_guard<std::mutex> autoLock(m_ssHandlerMutex);
try {
_socket.sendBytes(m_outputBuffer);
} catch (Poco::Exception& e) {
poco_error(*m_pLogger, e.displayText());
if (e.code() == EAGAIN || e.code() == EWOULDBLOCK) return;
// 就是这里引起死锁的,调用delete this等价于调用函数operator delete(虚构函数),然后在释放内存。
// 所以这里在函数里又进行了锁等待,从而因此死锁。
delete this;
return;
}
TcpServerHandler::~TcpServerHandler()
{
int connNum;
{
std::lock_guard<std::mutex> autoLock(m_ssHandlerMutex);
auto it = m_ssHandler.find(_socket);
if (it != m_ssHandler.end()) m_ssHandler.erase(it);
connNum = m_ssHandler.size();
}
}