|版权:本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。如有问题,可以邮件:wangxu198709@gmail.com
前言
相信很多人都有使用过sqlite3的经验,一年前因为项目上的需要,写了一个基于sqlite3的持久化队列库(persist-queue),已经发布在pypi上有段时间了。 前段时间,一下子来了两个issues,一个是关于in-memory database的support,一个是sqlite3 队列性能的问题。在数据量不大的情况下,sqlite的队列会在某些情况下出现:
sqlite3.OperationalError: database is locked
在修复上面的问题的时候,注意到,在多线程的情况下更容易触发上面的exception,并且比较怪的是,有时测试还有两个问题: 1. 出现死锁,并且CPU占用会一直保持在100%。
在程序里面下断点(import pdb;pdb.set_trace())和log都不好使,关键是无法精确定位到所有线程在当时的运行状态。
找了不少方法,最后还是发现了管用的方法,使用大名鼎鼎的: gdb
相比pdb,gdb有以下几个优点:
不需要显示的下断点,如"import pdb;pdb.set_trace()"
可以方便的调试多线程程序,允许你调试过程中切换调试线程。很多python debug是不支持的如 winpdb, pydevd
如果python解释器core dump了,生成的core dump文件可以直接用gdb 来分析,而gdb只能望“dump”兴叹了。
Python给gdb准备了以extension,方便用户不仅可以查看Python解释器的运行情况,还可以查看用户程序的运行情况。
python gdb extension在gdb的环境下提供了如下几个py-*命令
py-list查看当前python应用程序上下文
py-bt 查看当前python应用程序调用堆栈
py-bt-full 查看当前python应用程序调用堆栈,并且显示每个frame的详细情况
py-print 查看python变量
py-locals 查看当前的scope的变量
py-up 查看上一个frame
py-down 查看下一个frame
环境准备
首先按照需要的各种调试用到的包
1. 安装gdb, 调试的主要工具
sudo apt-get install gdb
2. 安装python-dbg, 用来在调试的时候看到python源代码的call stack
sudo apt-get install python-dbg
这里面我都是用的python2.7,如果用的是python3.5,需要对应的安装python3.5-dbg
重现问题(死锁,Hung process)
$ tox -e py27
py27 develop-inst-nodeps: /home/peter/Documents/persist-queue
py27 installed: configparser==3.5.0,cov-core==1.15.0,coverage==4.4.1,enum-compat==0.0.2,enum34==1.1.6,eventlet==0.21.0,flake8==3.5.0,funcsigs==1.0.2,greenlet==0.4.12,mccabe==0.6.1,mock==2.0.0,nose2==0.6.5,pbr==3.1.1,-e git+https://github.com/peter-wangxu/persist-queue@d26561e3c7c9a35fd75ddedf0a023a7dcbd563a9#egg=persist_queue,pkg-resources==0.0.0,pycodestyle==2.3.1,pyflakes==1.6.0,six==1.11.0,virtualenv==15.1.0
py27 runtests: PYTHONHASHSEED='564480154'
py27 runtests: commands[0] | nose2 --with-coverage --coverage-report xml --coverage-report term
WARNING:persistqueue.sqlbase:auto_commit=False is still experimental,only use it with care.
....
上面的测试永远不会返回,悲伤啊。。
使用gdb加载symbols
1. 首先找到进程号
$ ps -ef | greptox
peter9376 1777 0 15:10 pts/17 00:00:00 /usr/bin/python3 /usr/bin/tox
peter9409 9376 67 15:10 pts/17 01:04:14 /home/peter/Documents/persist-queue/.tox/py27/bin/python2.7 .tox/py27/bin/nose2 --with-coverage --coverage-report xml --coverage-report term
peter9913 9639 0 16:45 pts/18 00:00:00 grep --color=auto tox
上面的 9409 就是我卡住的进程了。
接下来就是使用gdb调试: sudo gdb python -p 9409
记得用用sudo,否则symbols有可能无法加载成功
sudo gdb python -p 9409GNU gdb (Ubuntu7.11.1-0ubuntu1~16.5) 7.11.1Copyright (C)2016Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are freeto change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type"show copying"and"show warranty" fordetails.
This GDB was configured as"x86_64-linux-gnu".
Type"show configuration" forconfiguration details.
For bug reporting instructions, please see:.
Find the GDB manual and other documentation resources online at:.
For help, type "help".
Type"apropos word" to search for commands related to "word"...
Reading symbols from python...Reading symbols from/usr/lib/debug/.build-id/cd/e2c487269892a2815c715667ae336984b82b0c.debug...done.done.
Attaching to program:/usr/bin/python, process 9409[New LWP9427]
[Thread debugging using libthread_db enabled]
Using host libthread_db library"/lib/x86_64-linux-gnu/libthread_db.so.1".0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
注意到这句 “Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/cd/e2c487269892a2815c715667ae336984b82b0c.debug...done.”
表示加载成功,否则会看到“no debugging symbols found”的log,可能是使用的python2.7跟dbg包的版本不一致。
2. 安装用到的包的dbg信息
注意到我上面有一句 “Reading symbols from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0...(no debugging symbols found)...done.”,表示我的sqlite3的symbols没有加载,直接安装就好
sudo apt-get install libsqlite3-0-dbg
重现进入gdb
看到这句
Reading symbols from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0...Reading symbols from /usr/lib/debug/.build-id/78/5ce90f42d13a60a36853327823ef5f90f2e8f1.debug...done.
所有我关心的symbols就加载完了。
使用gdb debug 程序
debug 主线程
首先,可以是用 info threads,查看当前的线程
(gdb) infothreads
Id Target Id Frame* 1 Thread 0x7fb92365b700 (LWP 9409) "nose2" 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
2 Thread 0x7fb91d4f1700 (LWP 9427) "nose2" _PyObject_GenericGetAttrWithDict () at ../Objects/object.c:1392
带 * 的这个就是当前正在debug的thread。
查看主线程为何一直卡住
(gdb) thread 1[Switching to thread1 (Thread 0x7fb92365b700 (LWP 9409))]#0 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file ordirectory.
(gdb) py-list335waiter.acquire()336 self.__waiters.append(waiter)337 saved_state =self._release_save()338 try: #restore state no matter what (e.g., KeyboardInterrupt)
339 if timeout isNone:>340waiter.acquire()341 if __debug__:342 self._note("%s.wait(): got it", self)343 else:344 #Balancing act: We can't afford a pure busy loop, so we
345 #have to sleep; but if we sleep the whole timeout time,
(gdb) py-up#7 Frame 0x7fb9180068e0, for file /usr/lib/python2.7/threading.py, line 340, in wait (self=<_Condition(_Condition__lock=, acquire=, _Verbose__verbose=False, _Condition__waiters=[], release=) at remote 0x7fb91fdb1c10>, timeout=None, waiter=, saved_state=None)
waiter.acquire()
(gdb) py-up#11 Frame 0x7fb91fdb69b0, for file /usr/lib/python2.7/threading.py, line 940, in join (self=, acquire=, _Verbose__verbose=False, _Condition__waiters=[], release=) at remote 0x7fb91fdb1c10>, _Thread__args=(3,), _Thread__ident=140433037399808, _Thread__target=, _Thread__daemonic=False, _Thread__initialized=True, _Thread__started=<_Event(_Event__flag=True, _Verbose__verbose=False, _Event__cond=<_Condition(_Condition__lock=, acquire=, _Verbose__verbose=False, _Condition__waiters=[], release=) at remote 0x7fb91fdb1bd0>) at remote 0x7fb91f...(truncated)
self.__block.wait()
(gdb) py-up#15 Frame 0x13d7ba0, for file /home/peter/Documents/persist-queue/tests/test_sqlqueue.py, line 179, in test_multiple_consumers (self=: 'assertSetEqual', : 'assertMultiLineEqual', : 'assertDictEqual', : 'assertTupleEqual', : 'assertSetEqual', : 'assertListEqual'}, _resultForDoCleanups=, logLevel=30, hooks=
t.join()
(gdb) py-list174t.start()175consumers.append(t)176
177p.join()178 for t inconsumers:>179t.join()180
181self.assertEqual(0, queue.qsize())182 for x in range(1000):183self.assertNotEqual(0, counter[x],184 "not 0 for counter's index %s" % x)
可以看到我使用了py-list和几次的py-up来到了我写的测试test_sqlqueue.py里面。
发现主线程是在t.join在等待子线程完成,但是奇怪的是,子线程一直没有完成任务。
debug工作线程
接下来,使用 thread 2 切换到上面的 2 线程。
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f02f51f9700 (LWP 18749))]
#0 PyType_IsSubtype () at ../Objects/typeobject.c:1184
1184 ../Objects/typeobject.c: No such file or directory.
使用py-list查看当前运行位置
(gdb) py-list
64 raise Empty
65 elif timeout is None:
66 # block until a put event.
67 pickled = self._pop()
68 while not pickled:
>69 self.put_event.wait()
70 pickled = self._pop()
71 elif timeout < 0:
72 raise ValueError("'timeout' must be a non-negative number")
73 else:
74 # block until the timeout reached
(gdb)
可以看到,工作线程是在 "self.put_event.wait()“,一种可能是self.put_event一直是false,没有被set位True,所以线程就这么一直等着。继续分析
使用py-locals 查看put-event当前值
(gdb) py-up#26 Frame 0x7f02f82b3608, for file /home/peter/Documents/persist-queue/persistqueue/sqlqueue.py, line 49, in _pop (self=, path=':memory:', action_lock=, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=, release=, _Condition__lock=, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=, timeout=, auto_commit=True, _getter=, _conn=) at remote 0x7f02f8491a10>)
row =self._select()
(gdb) py-list44 #Action lock to assure multiple action to be *atomic*
45 self.action_lock =threading.Lock()46
47 def_pop(self):48with self.action_lock:>49 row =self._select()50 #Perhaps a sqlite3 bug, sometimes (None, None) is returned
51 #by select, below can avoid these invalid records.
52 if row and row[0] is notNone:53self._delete(row[0])54 if notself.auto_commit:
(gdb) py-printself
local'self' = , path=':memory:', action_lock=, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=, release=, _Condition__lock=, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=, timeout=, auto_commit=True, _getter=, _conn=) at remote 0x7f02f8491a10>(gdb) py-locals
self= , path=':memory:', action_lock=, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=, release=, _Condition__lock=, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=, timeout=, auto_commit=True, _getter=, _conn=) at remote 0x7f02f8491a10>
可以看到这个put_event是一直为True(_Event__flag=True),仔细想想,我好像没设置为False过。。。
结合代码:
1 def get(self, block=True, timeout=None):2 if notblock:3 pickled =self._pop()4 if notpickled:5 raiseEmpty6 elif timeout isNone:7 #block until a put event.
8 pickled =self._pop()9 while notpickled:10 self.put_event.wait(TICK_FOR_WAIT)11 pickled =self._pop()12 elif timeout <0:13 raise ValueError("'timeout' must be a non-negative number")14 else:15 #block until the timeout reached
16 endtime = _time.time() +timeout17 pickled =self._pop()18 while notpickled:19 remaining = endtime -_time.time()20 if remaining <= 0.0:21 raiseEmpty22 self.put_event.wait(23 TICK_FOR_WAIT if TICK_FOR_WAIT < remaining elseremaining)24 pickled =self._pop()25 item =pickle.loads(pickled)26 return item
如果因为某种情况,queue为空,那么上面的9-11行就是一个无限循环,并且cpu会被占用到100%,
好了,到这里就大概搞清楚为什么有时候多线程测试是,程序一直为hung住,并且CPU的占用一直是100%,还会报出 ”sqlite3.OperationalError: database is locked“ 的error(因为程序的线程一直在尝试读数据库,如果其他线程有update,或者inster拿到了sqlite的文件锁,读线程会读超时)
参考文献
http://podoliaka.org/2016/04/10/debugging-cpython-gdb
https://wiki.python.org/moin/DebuggingWithGdb