python debugging symbols_用gdb调试python多线程代码-记一次死锁的发现

|版权:本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。如有问题,可以邮件:wangxu198709@gmail.com

前言

相信很多人都有使用过sqlite3的经验,一年前因为项目上的需要,写了一个基于sqlite3的持久化队列库(persist-queue),已经发布在pypi上有段时间了。 前段时间,一下子来了两个issues,一个是关于in-memory database的support,一个是sqlite3 队列性能的问题。在数据量不大的情况下,sqlite的队列会在某些情况下出现:

sqlite3.OperationalError: database is locked

在修复上面的问题的时候,注意到,在多线程的情况下更容易触发上面的exception,并且比较怪的是,有时测试还有两个问题: 1. 出现死锁,并且CPU占用会一直保持在100%。

在程序里面下断点(import pdb;pdb.set_trace())和log都不好使,关键是无法精确定位到所有线程在当时的运行状态。

找了不少方法,最后还是发现了管用的方法,使用大名鼎鼎的: gdb

相比pdb,gdb有以下几个优点:

不需要显示的下断点,如"import pdb;pdb.set_trace()"

可以方便的调试多线程程序,允许你调试过程中切换调试线程。很多python debug是不支持的如 winpdb, pydevd

如果python解释器core dump了,生成的core dump文件可以直接用gdb 来分析,而gdb只能望“dump”兴叹了。

Python给gdb准备了以extension,方便用户不仅可以查看Python解释器的运行情况,还可以查看用户程序的运行情况。

python gdb extension在gdb的环境下提供了如下几个py-*命令

py-list查看当前python应用程序上下文

py-bt 查看当前python应用程序调用堆栈

py-bt-full 查看当前python应用程序调用堆栈,并且显示每个frame的详细情况

py-print 查看python变量

py-locals 查看当前的scope的变量

py-up 查看上一个frame

py-down 查看下一个frame

环境准备

首先按照需要的各种调试用到的包

1. 安装gdb, 调试的主要工具

sudo apt-get install gdb

2. 安装python-dbg, 用来在调试的时候看到python源代码的call stack

sudo apt-get install python-dbg

这里面我都是用的python2.7,如果用的是python3.5,需要对应的安装python3.5-dbg

重现问题(死锁,Hung process)

$ tox -e py27

py27 develop-inst-nodeps: /home/peter/Documents/persist-queue

py27 installed: configparser==3.5.0,cov-core==1.15.0,coverage==4.4.1,enum-compat==0.0.2,enum34==1.1.6,eventlet==0.21.0,flake8==3.5.0,funcsigs==1.0.2,greenlet==0.4.12,mccabe==0.6.1,mock==2.0.0,nose2==0.6.5,pbr==3.1.1,-e git+https://github.com/peter-wangxu/persist-queue@d26561e3c7c9a35fd75ddedf0a023a7dcbd563a9#egg=persist_queue,pkg-resources==0.0.0,pycodestyle==2.3.1,pyflakes==1.6.0,six==1.11.0,virtualenv==15.1.0

py27 runtests: PYTHONHASHSEED='564480154'

py27 runtests: commands[0] | nose2 --with-coverage --coverage-report xml --coverage-report term

WARNING:persistqueue.sqlbase:auto_commit=False is still experimental,only use it with care.

....

上面的测试永远不会返回,悲伤啊。。

使用gdb加载symbols

1. 首先找到进程号

$ ps -ef | greptox

peter9376 1777 0 15:10 pts/17 00:00:00 /usr/bin/python3 /usr/bin/tox

peter9409 9376 67 15:10 pts/17 01:04:14 /home/peter/Documents/persist-queue/.tox/py27/bin/python2.7 .tox/py27/bin/nose2 --with-coverage --coverage-report xml --coverage-report term

peter9913 9639 0 16:45 pts/18 00:00:00 grep --color=auto tox

上面的 9409 就是我卡住的进程了。

接下来就是使用gdb调试: sudo gdb python -p 9409

记得用用sudo,否则symbols有可能无法加载成功

sudo gdb python -p 9409GNU gdb (Ubuntu7.11.1-0ubuntu1~16.5) 7.11.1Copyright (C)2016Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later

This is free software: you are freeto change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type"show copying"and"show warranty" fordetails.

This GDB was configured as"x86_64-linux-gnu".

Type"show configuration" forconfiguration details.

For bug reporting instructions, please see:.

Find the GDB manual and other documentation resources online at:.

For help, type "help".

Type"apropos word" to search for commands related to "word"...

Reading symbols from python...Reading symbols from/usr/lib/debug/.build-id/cd/e2c487269892a2815c715667ae336984b82b0c.debug...done.done.

Attaching to program:/usr/bin/python, process 9409[New LWP9427]

[Thread debugging using libthread_db enabled]

Using host libthread_db library"/lib/x86_64-linux-gnu/libthread_db.so.1".0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.

注意到这句 “Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/cd/e2c487269892a2815c715667ae336984b82b0c.debug...done.”

表示加载成功,否则会看到“no debugging symbols found”的log,可能是使用的python2.7跟dbg包的版本不一致。

2. 安装用到的包的dbg信息

注意到我上面有一句 “Reading symbols from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0...(no debugging symbols found)...done.”,表示我的sqlite3的symbols没有加载,直接安装就好

sudo apt-get install libsqlite3-0-dbg

重现进入gdb

看到这句

Reading symbols from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0...Reading symbols from /usr/lib/debug/.build-id/78/5ce90f42d13a60a36853327823ef5f90f2e8f1.debug...done.

所有我关心的symbols就加载完了。

使用gdb debug 程序

debug 主线程

首先,可以是用 info threads,查看当前的线程

(gdb) infothreads

Id Target Id Frame* 1 Thread 0x7fb92365b700 (LWP 9409) "nose2" 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60)

at ../sysdeps/unix/sysv/linux/futex-internal.h:205

2 Thread 0x7fb91d4f1700 (LWP 9427) "nose2" _PyObject_GenericGetAttrWithDict () at ../Objects/object.c:1392

带 * 的这个就是当前正在debug的thread。

查看主线程为何一直卡住

(gdb) thread 1[Switching to thread1 (Thread 0x7fb92365b700 (LWP 9409))]#0 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file ordirectory.

(gdb) py-list335waiter.acquire()336 self.__waiters.append(waiter)337 saved_state =self._release_save()338 try: #restore state no matter what (e.g., KeyboardInterrupt)

339 if timeout isNone:>340waiter.acquire()341 if __debug__:342 self._note("%s.wait(): got it", self)343 else:344 #Balancing act: We can't afford a pure busy loop, so we

345 #have to sleep; but if we sleep the whole timeout time,

(gdb) py-up#7 Frame 0x7fb9180068e0, for file /usr/lib/python2.7/threading.py, line 340, in wait (self=<_Condition(_Condition__lock=, acquire=, _Verbose__verbose=False, _Condition__waiters=[], release=) at remote 0x7fb91fdb1c10>, timeout=None, waiter=, saved_state=None)

waiter.acquire()

(gdb) py-up#11 Frame 0x7fb91fdb69b0, for file /usr/lib/python2.7/threading.py, line 940, in join (self=, acquire=, _Verbose__verbose=False, _Condition__waiters=[], release=) at remote 0x7fb91fdb1c10>, _Thread__args=(3,), _Thread__ident=140433037399808, _Thread__target=, _Thread__daemonic=False, _Thread__initialized=True, _Thread__started=<_Event(_Event__flag=True, _Verbose__verbose=False, _Event__cond=<_Condition(_Condition__lock=, acquire=, _Verbose__verbose=False, _Condition__waiters=[], release=) at remote 0x7fb91fdb1bd0>) at remote 0x7fb91f...(truncated)

self.__block.wait()

(gdb) py-up#15 Frame 0x13d7ba0, for file /home/peter/Documents/persist-queue/tests/test_sqlqueue.py, line 179, in test_multiple_consumers (self=: 'assertSetEqual', : 'assertMultiLineEqual', : 'assertDictEqual', : 'assertTupleEqual', : 'assertSetEqual', : 'assertListEqual'}, _resultForDoCleanups=, logLevel=30, hooks=

t.join()

(gdb) py-list174t.start()175consumers.append(t)176

177p.join()178 for t inconsumers:>179t.join()180

181self.assertEqual(0, queue.qsize())182 for x in range(1000):183self.assertNotEqual(0, counter[x],184 "not 0 for counter's index %s" % x)

可以看到我使用了py-list和几次的py-up来到了我写的测试test_sqlqueue.py里面。

发现主线程是在t.join在等待子线程完成,但是奇怪的是,子线程一直没有完成任务。

debug工作线程

接下来,使用 thread 2 切换到上面的 2 线程。

(gdb) thread 2

[Switching to thread 2 (Thread 0x7f02f51f9700 (LWP 18749))]

#0 PyType_IsSubtype () at ../Objects/typeobject.c:1184

1184 ../Objects/typeobject.c: No such file or directory.

使用py-list查看当前运行位置

(gdb) py-list

64 raise Empty

65 elif timeout is None:

66 # block until a put event.

67 pickled = self._pop()

68 while not pickled:

>69 self.put_event.wait()

70 pickled = self._pop()

71 elif timeout < 0:

72 raise ValueError("'timeout' must be a non-negative number")

73 else:

74 # block until the timeout reached

(gdb)

可以看到,工作线程是在 "self.put_event.wait()“,一种可能是self.put_event一直是false,没有被set位True,所以线程就这么一直等着。继续分析

使用py-locals 查看put-event当前值

(gdb) py-up#26 Frame 0x7f02f82b3608, for file /home/peter/Documents/persist-queue/persistqueue/sqlqueue.py, line 49, in _pop (self=, path=':memory:', action_lock=, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=, release=, _Condition__lock=, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=, timeout=, auto_commit=True, _getter=, _conn=) at remote 0x7f02f8491a10>)

row =self._select()

(gdb) py-list44 #Action lock to assure multiple action to be *atomic*

45 self.action_lock =threading.Lock()46

47 def_pop(self):48with self.action_lock:>49 row =self._select()50 #Perhaps a sqlite3 bug, sometimes (None, None) is returned

51 #by select, below can avoid these invalid records.

52 if row and row[0] is notNone:53self._delete(row[0])54 if notself.auto_commit:

(gdb) py-printself

local'self' = , path=':memory:', action_lock=, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=, release=, _Condition__lock=, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=, timeout=, auto_commit=True, _getter=, _conn=) at remote 0x7f02f8491a10>(gdb) py-locals

self= , path=':memory:', action_lock=, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=, release=, _Condition__lock=, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=, timeout=, auto_commit=True, _getter=, _conn=) at remote 0x7f02f8491a10>

可以看到这个put_event是一直为True(_Event__flag=True),仔细想想,我好像没设置为False过。。。

结合代码:

1 def get(self, block=True, timeout=None):2 if notblock:3 pickled =self._pop()4 if notpickled:5 raiseEmpty6 elif timeout isNone:7 #block until a put event.

8 pickled =self._pop()9 while notpickled:10 self.put_event.wait(TICK_FOR_WAIT)11 pickled =self._pop()12 elif timeout <0:13 raise ValueError("'timeout' must be a non-negative number")14 else:15 #block until the timeout reached

16 endtime = _time.time() +timeout17 pickled =self._pop()18 while notpickled:19 remaining = endtime -_time.time()20 if remaining <= 0.0:21 raiseEmpty22 self.put_event.wait(23 TICK_FOR_WAIT if TICK_FOR_WAIT < remaining elseremaining)24 pickled =self._pop()25 item =pickle.loads(pickled)26 return item

如果因为某种情况,queue为空,那么上面的9-11行就是一个无限循环,并且cpu会被占用到100%,

好了,到这里就大概搞清楚为什么有时候多线程测试是,程序一直为hung住,并且CPU的占用一直是100%,还会报出 ”sqlite3.OperationalError: database is locked“ 的error(因为程序的线程一直在尝试读数据库,如果其他线程有update,或者inster拿到了sqlite的文件锁,读线程会读超时)

参考文献

http://podoliaka.org/2016/04/10/debugging-cpython-gdb

https://wiki.python.org/moin/DebuggingWithGdb

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值