最近在公司运行的一个老项目所在的MySQL服务器经常收到有对系统性能有效大影响的阻塞事件告警,阻塞时间长达一分钟。
在MySQL服务器中查询到阻塞进程的SQL语句如下:
UPDATE user SET nickname='' WHERE usercode='117328';
查询到数据库的的默认编码是utf8,而SQL语句中的字段是emoji表情符号,应该是应用程序没有处理好表情符号的问题,导致更新数据报错:
_mysql_exceptions.OperationalError: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x80\\xF0\\x9F...' for column 'nickname' at row 1")
(数据库的编码支持emoji表情的话应该设置成utf8mb4)
报错的问题找到了,将数据库的编码设置成utf8mb4支持4字节就可以了。但是阻塞的问题没有解决。
于是查询了出错的日志信息,发现每次更新nickname报错的时候,耗时都特别长,接近60秒。因此怀疑是做更新的时候UPDATE语句在一直等待要更新的行锁。问题是谁持有的行锁没有释放呢?
为了查找元凶,在测试环境打开了mysql的日志查询:
set global general_log=1;
调用测试用例测试user表的时候,发现mysql的日志中有三条相同的update日志,其中有两条是属于同一个线程ID,
180822 16:55:44 27114 Connect usercenter@172.16.10.14 on usercenter
27114 Query SET NAMES utf8
27114 Query set autocommit=0
27115 Connect usercenter@172.16.10.14 on usercenter
27115 Query SET NAMES utf8
27115 Query set autocommit=0
27116 Connect usercenter@172.16.10.14 on usercenter
27116 Query SET NAMES utf8
27116 Query set autocommit=0
27117 Connect usercenter@172.16.10.14 on usercenter
27117 Query SET NAMES utf8
27117 Query set autocommit=0
27118 Connect usercenter@172.16.10.14 on usercenter
27118 Query SET NAMES utf8
27118 Query set autocommit=0
27118 Query rollback
27117 Query rollback
27116 Query rollback
27115 Query rollback
27114 Query rollback
27118 Query UPDATE user SET nickname='' WHERE usercode='117328'
27118 Query UPDATE user SET nickname='' WHERE usercode='117328'
27119 Connect usercenter@172.16.10.14 on usercenter
27119 Query SET NAMES utf8
27119 Query set autocommit=0
27119 Query UPDATE user SET nickname='' WHERE usercode='117328'
查询mysql的processlist发现,被阻塞的update语句所在的线程ID正好是第三条UPDATE语句
| 27114 | usercenter | 172.16.10.14:60404 | usercenter | Sleep | 35 | | NULL |
| 27115 | usercenter | 172.16.10.14:60405 | usercenter | Sleep | 35 | | NULL |
| 27116 | usercenter | 172.16.10.14:60406 | usercenter | Sleep | 35 | | NULL |
| 27117 | usercenter | 172.16.10.14:60407 | usercenter | Sleep | 35 | | NULL |
| 27118 | usercenter | 172.16.10.14:60408 | usercenter | Sleep | 35 | | NULL |
| 27119 | usercenter | 172.16.10.14:60409 | usercenter | Query | 35 | updating | UPDATE ih_user SET nickname='' WHERE usercode='117328'
很显然是27118这个连接做了UPDATE,没有COMMIT或者ROLLBACK,导致行锁没有释放,以至于27119的连接一直在等待锁。
这肯定是个BUG,仔细检查了应用程序里的代码块,没有发现持续占有锁没有释放的情况。
最后问题锁定到了程序里用到的数据库连接池DBUtils,直接上出问题的代码块:
def _get_tough_method(self, name):
"""Return a "tough" version of the given cursor method."""
def tough_method(*args, **kwargs):
execute = name.startswith('execute')
con = self._con
transaction = con._transaction
if not transaction:
con._ping_check(4)
try:
if con._maxusage:
if con._usage >= con._maxusage:
# the connection was used too often
raise con._failure
if execute:
self._setsizes()
method = getattr(self._cursor, name)
result = method(*args, **kwargs) # try to execute
if execute:
self._clearsizes()
except con._failures, error: # execution error
if not transaction:
try:
cursor2 = con._cursor(
*self._args, **self._kwargs) # open new cursor
except Exception:
pass
else:
try: # and try one more time to execute
if execute:
self._setsizes(cursor2)
method = getattr(cursor2, name)
result = method(*args, **kwargs)
if execute:
self._clearsizes()
except Exception:
pass
else:
self.close()
self._cursor = cursor2
con._usage += 1
return result
try:
cursor2.close()
except Exception:
pass
try: # try to reopen the connection
con2 = con._create()
except Exception:
pass
else:
try:
cursor2 = con2.cursor(
*self._args, **self._kwargs) # open new cursor
except Exception:
pass
else:
if transaction:
self.close()
con._close()
con._store(con2)
self._cursor = cursor2
raise error # raise the original error again
try: # try one more time to execute
if execute:
self._setsizes(cursor2)
method2 = getattr(cursor2, name)
result = method2(*args, **kwargs)
if execute:
self._clearsizes()
except error.__class__: # same execution error
use2 = False
except Exception, error: # other execution errors
use2 = True
else:
use2 = True
error = None
if use2:
self.close()
con._close()
con._store(con2)
self._cursor = cursor2
con._usage += 1
if error:
raise error # raise the other error
return result
try:
cursor2.close()
except Exception:
pass
try:
con2.close()
except Exception:
pass
if transaction:
self._transaction = False
raise error # reraise the original error again
else:
con._usage += 1
return result
return tough_method
在这段代码中,如果执行SQL脚本报错,如果异常属于OperationalError或者是InternalError,并且当前的transaction不为True的话,DBUtils会尝试重试一次,如果重试一次仍然报错,则会开启一个新的连接再试一次,最后才会抛出异常。
在DBUtils作者看来,OperationalError和InternalError出现的时候,在没有开启事务的情况下(self._transaction = False),显然是可以重试的。在本例中,DBUtils创建连接池的时候,向mysql服务器申请5个连接时却已经设置了autocommit为0,也就是说执行这个UPDATE语句需要显示的COMMIT或者ROLLBACK。这也导致DBUtils在尝试第三次UPDATE时出现了锁等待。
解决这个问题很简单,调用SteadyDBConnection.begin()方法显示的开启一个事务即可。
引申:
对于这个DBUtils的问题做了更深的思考,为什么DBUtils作者会把self.con._failures默认值设置为OperationalError和InternalError:
if self._failures is None:
try:
self._failures = (
self._dbapi.OperationalError,
self._dbapi.InternalError)
except AttributeError:
try:
self._failures = (
self._creator.OperationalError,
self._creator.InternalError)
except AttributeError:
try:
self._failures = (
con.OperationalError, con.InternalError)
except AttributeError:
raise AttributeError(
"Could not determine failure exceptions"
" (please set failures or creator.dbapi).")
可能认为这两类错误不会加锁,不需要提交或者回滚吧。
在PEP-0249中,对OperationalError定义如下:
Exception raised for errors that are related to the database's operation and not necessarily under the control of the programmer, e.g. an unexpected disconnect occurs, the data source name is not found, a transaction could not be processed, a memory allocation error occurred during processing, etc. It must be a subclass of DatabaseError.
绝大部分OperationalError抛出时,可能都不会持有锁,也就无所谓锁释放的问题。但是对于本例中返回的1366错误,UPDATE语句却对被更新的行加锁了,导致了锁等待的问题。PS:在构造DBUtils连接池时,是可以传failures参数来覆盖掉默认值的