今天在将爬虫数据写入MySQL的时候,发现了一个很蛋疼的问题
在管道中写操作MySQL的逻辑的时候,有一部分数据能写入数据库,但是另外一部分数据写不进去,经过检查 发现并不是数据格式的问题。
OperationalError (2006, 'MySQL server has gone away')
经过多方排查最终确定为由于爬虫的时候有些网页可能需要 很长时间才能打开,导致MySQL连接超时。
解决的办法:
在代码中加入:
self.db.ping(True)
仔细看这个方法:
def ping(self): # real signature unknown; restored from __doc__
"""
Checks whether or not the connection to the server is
working. If it has gone down, an automatic reconnection is
attempted.
This function can be used by clients that remain idle for a
long while, to check whether or not the server has closed the
connection and reconnect if necessary.
New in 1.2.2: Accepts an optional reconnect parameter. If True,
then the client will attempt reconnection. Note that this setting
is persistent. By default, this is on in MySQL<5.0.3, and off
thereafter.
Non-standard. You should assume that ping() performs an
implicit rollback; use only when starting a new transaction.
You have been warned.
"""
pass
说的而是检测MySQL的连接是否断开,如果断开就自动获取新的连接。
管道的完整代码如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class LetvShowPipeline(object):
def __init__(self):
self.db = MySQLdb.connect("IP地址", "用户名", "密码", "vboxDB", charset='utf8', port=端口)
self.cursor = self.db.cursor()
def process_item(self, item, spider):
# 提取字段
area = item['area']
host = item['host']
live_channel = item['live_channel']
name = item['name']
programs = item['programs']
selfproduced = item['selfproduced']
type = item['type']
number = item['number']
vv = item['vv']
year = item['year']
print "==============================================="
allhost = ''
for h in host:
allhost += h + '$'
print area, allhost, live_channel, name, programs, selfproduced, type, vv, year
# 写进数据库
self.db.ping(True)
# 先查询数据库中有没有数据
sql = "SELECT * FROM letv_show WHERE name='%s'" % name
self.cursor.execute(sql)
result = self.cursor.fetchall()
if len(result) != 0:
print 'updating data....'
# 如果数据库中已经存在该卡通,更新vv和number字段
sql = "UPDATE letv_show SET vv= '%s',number='%s' WHERE name='%s'" % (vv, number, name)
else:
print 'insert new data ....'
# 如果数据库中没有该卡通将数据写入MySQL
sql = "INSERT INTO letv_show(area, host, live_channel, name, programs, selfproduced, type, vv, year,number) \
VALUES ('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')" % \
(area, allhost, live_channel, name, programs, selfproduced, type, vv, year, number)
try:
# 执行sql语句
self.cursor.execute(sql)
# 提交到数据库执行
self.db.commit()
except:
# Rollback in case there is any error
self.db.rollback()
return item