今天早上一起来,发现有两三个节点的scrapy疯狂报错,将近几万页,错误信息为:
2019-07-12 21:48:44 [twisted] CRITICAL: Rollback failed
Traceback (most recent call last):
File "/home/anaconda3/envs/python36/lib/python3.6/site-packages/twisted/python/threadpool.py", line 250, in inContext
result = inContext.theWork()
File "/home/anaconda3/envs/python36/lib/python3.6/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
File "/home/anaconda3/envs/python36/lib/python3.6/site-packages/twisted/python/context.py", line 122, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/home/anaconda3/envs/python36/lib/python3.6/site-packages/twisted/python/context.py", line 85, in callWithContext
return func(*args,**kw)
--- <exception caught here> ---
File "/home/anaconda3/envs/python36/lib/python3.6/site-packages/twisted/enterprise/adbapi.py", line 474, in _runInteraction
conn.rollback()
File "/home/anaconda3/envs/python36/lib/python3.6/site-packages/twisted/enterprise/adbapi.py", line 52, in rollback
self._connection.rollback()
File "/home/anaconda3/envs/python36/lib/python3.6/site-packages/pymysql/connections.py", line 431, in rollback
self._execute_command(COMMAND.COM_QUERY, "ROLLBACK")
File "/home/anaconda3/envs/python36/lib/python3.6/site-packages/pymysql/connections.py", line 745, in _execute_command
raise err.InterfaceError("(0, '')")
pymysql.err.InterfaceError: (0, '')
翻了一下日志信息,发现有一个特点,就是在报错之前的信息大多都是抓取listing的信息,而没有返回item,也没有item入库的日志打印,所以我怀疑是adbapi的连接池中的连接很久没有使用导致连接被mysql销毁,插入的时候失败,然后事务回滚,连接池中数据库连接异常导致回滚失败,最后得到这个报错信息。
我item pipeline中使用数据库连接方法为:
# pipeline默认调用
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
query.addErrback(self._handle_error, item, spider)
return item
最后我修改了一下,每次插入数据的时候ping一下,如果重连失败就将整个数据库连接池初始化一遍,完整的代码为
class MyPipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
dbpool = MysqlConnectionPool().dbpool()
return cls(dbpool)
# pipeline默认调用
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
query.addErrback(self._handle_error, item, spider)
return item
def _handle_error(self, failue, item, spider):
print(failue)
def _conditional_insert(self, transction, item):
tt = transction._connection._connection
try:
tt.ping()
except:
self.dbpool.close()
self.dbpool = MysqlConnectionPool().dbpool()
sql = """insert INTO `DOC_BASEINFO`(doc_type,author_org )
VALUES (%s,%s)"""
params = (
item['doc_type'], item['author_org'])
transction.execute(sql, params)