怎样排查多线程死锁问题?

在使用多线程进行开发,大概率会遇到死锁问题。如何排查死锁问题,也变得非常重要。

  1. 第一步,我们可以通过pstack 捕获当前程序运行栈的信息,多次捕获比较,就能找到死锁的地方。 接下来可以通过分析代码,如果比较简单,确定是四个必要条件(互斥、占有且等待、不可抢占、循环等待)其中之一,打破即可。如果无法定位,代码看不出什么问题,就要进行调试。
  2. 第二步,当发生死锁,gdb attach 进程,info thread 查看线程, thread N 切换线程,查找所等待的互斥量,是被那个线程占用,一步一步排查。一般通过这一步大多数问题都能解决。
  3. 第三步,如果一二步定位问题困难,可以尝试以下步骤,gdb调试程序,在加解锁的地方,打断点,并附加commond。这样当死锁发生时,我们可以知道当前各个线程的栈信息。

下面举一个具体的例子

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define LEN 10000
int num = 0;
pthread_mutex_t g_mutex;

void* thread_func(void* arg) {
    for (int i = 0; i < LEN; ++i) {
        pthread_mutex_lock(&g_mutex);
        num += 1;
        if (num == 9999) return NULL;  //锁未释放,直接返回 这里只是一个简单的例子
        pthread_mutex_unlock(&g_mutex);
    }

    return NULL;
}

int main() {
    pthread_mutex_init(&g_mutex, NULL);

    pthread_t tid1, tid2, tid3;
    pthread_create(&tid1, NULL, thread_func, NULL);
    pthread_create(&tid2, NULL, thread_func, NULL);
    pthread_create(&tid3, NULL, thread_func, NULL);

    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);
    pthread_join(tid3, NULL);

    pthread_mutex_destroy(&g_mutex);

    printf("Check RST=%d, RST=%d.\n", 3 * LEN, num);
    return 0;
}
方法1 pstack
[root@localhost ~]# ps -ef|grep ./thr
root     27736 26354  0 15:15 pts/1    00:00:00 ./thr
root     27753 27613  0 15:16 pts/2    00:00:00 grep --color=auto ./thr
[root@localhost ~]# pstack 27736
Thread 3 (Thread 0x7f26c4ebc700 (LWP 27737)):
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f26c46bb700 (LWP 27738)):
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f26c568a740 (LWP 27736)):
#0  0x00007f26c5286ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000000004008c9 in main () at thr.cpp:30
[root@localhost ~]# pstack 27736
Thread 3 (Thread 0x7f26c4ebc700 (LWP 27737)):
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f26c46bb700 (LWP 27738)):
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f26c568a740 (LWP 27736)):
#0  0x00007f26c5286ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000000004008c9 in main () at thr.cpp:30

可以看到两次捕获没有多大区别都卡在了thr.cpp:13 这一行。

方法2

[root@localhost ~]# gdb attch 27736
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attch: No such file or directory.
Attaching to process 27736
Reading symbols from /root/jiangsu-wuxi/poco_demo/thr...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 27738]
[New LWP 27737]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007f26c5286ef7 in pthread_join () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64
(gdb) info thread
  Id   Target Id         Frame 
  3    Thread 0x7f26c4ebc700 (LWP 27737) "thr" 0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
  2    Thread 0x7f26c46bb700 (LWP 27738) "thr" 0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
* 1    Thread 0x7f26c568a740 (LWP 27736) "thr" 0x00007f26c5286ef7 in pthread_join () from /lib64/libpthread.so.0
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f26c4ebc700 (LWP 27737))]
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
13	        pthread_mutex_lock(&g_mutex);
(gdb) p g_mutex
$1 = {__data = {__lock = 2, __count = 0, __owner = 27739, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000[l\000\000\001", '\000' <repeats 26 times>, __align = 2}
(gdb) 

可以看到__owner = 27739 , 互斥量被27739占用,但实际并没有这个线程。

方法三:gdb ,commands

[root@localhost poco_demo]# gdb ./thr
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/jiangsu-wuxi/poco_demo/thr...done.
(gdb) b thr.cpp :13
Breakpoint 1 at 0x400805: file thr.cpp, line 13.
(gdb) b thr.cpp :16
Breakpoint 2 at 0x400832: file thr.cpp, line 16.
(gdb) i b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000400805 in thread_func(void*) at thr.cpp:13
2       breakpoint     keep y   0x0000000000400832 in thread_func(void*) at thr.cpp:16
(gdb) commands 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>p "lock"
>thread
>c
>end
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>p "unlock"
>thread
>c
>end
(gdb) set pagination off
......
Breakpoint 1, thread_func (arg=0x0) at thr.cpp:13
13	        pthread_mutex_lock(&g_mutex);
$19997 = "lock"
[Current thread is 2 (Thread 0x7ffff77fd700 (LWP 29692))]
[Switching to Thread 0x7ffff67fb700 (LWP 29694)]

Breakpoint 2, thread_func (arg=0x0) at thr.cpp:16
16	        pthread_mutex_unlock(&g_mutex);
$19998 = "unlock"
[Current thread is 4 (Thread 0x7ffff67fb700 (LWP 29694))]

Breakpoint 1, thread_func (arg=0x0) at thr.cpp:13
13	        pthread_mutex_lock(&g_mutex);
$19999 = "lock"
[Current thread is 4 (Thread 0x7ffff67fb700 (LWP 29694))]
[Thread 0x7ffff77fd700 (LWP 29692) exited]

最后发现值卡在thr.cpp:13 这个地方,然后再根据当前堆栈来分析问题,确定if (num == 9999) return NULL; 这一行,返回时,没有释放锁,导致,其他线程,包括本线程,再也拿不到锁了。
ps : 如果不设置 set pagination off, 到达一定量之后,调试会自动暂停,需要收到确认才能继续

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值