具体抓取代码没什么好讲的,要注意的地方就是将抓取的数据插入数据库。
每1000条数据提交一次数据库:
pipelines.py
def process_item(self, item, spider):
try:
page_data = (item["scode"], item["name"], item["gender"], item["age"], item["education"], item["position"], item["in_office_time"], item["introduction"], item["insert_time"],
item["hold_count"], item["order_num"])
self.item_list.append(tuple(page_data))
if(len(self.item_list)==1000):#1000条提交一次
self.cursor.executemany(
"INSERT INTO ssb_insight_company_team_manager_info(scode,name,gender,age,education,position,in_office_time,introduction,insert_time,hold_count,order_num" \
") values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) " \
"on duplicate key update name =values(name),gender=values(gender)," \
"age =values(age),education=values(education)," \
"position =values(position),in_office_time=values(in_office_time)," \
"introduction =values(introduction),hold_count=values(hold_count),order_num=values(order_num)",
(self.item_list)
)
# 提交sql语句
self.connect.commit()
del self.item_list[:]
except Exception as e:
# 出现错误时打印错误日志
log.INFO("数据库插入异常===", e)
self.connect.rollback()
return item
注意:如果只是这样的话,还会有一个问题,最后一批数据,如果不满足1000,则不会插入数据库,这样会有数据遗漏,所以还需要在关闭爬虫的时候,将剩下的数据提交一次
def close_spider(self, spider):
try:
self.cursor.executemany(
"INSERT INTO ssb_insight_company_team_manager_info(scode,name,gender,age,education,position,in_office_time,introduction,insert_time,hold_count,order_num" \
") values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) " \
"on duplicate key update name =values(name),gender=values(gender)," \
"age =values(age),education=values(education)," \
"position =values(position),in_office_time=values(in_office_time)," \
"introduction =values(introduction),hold_count=values(hold_count),order_num=values(order_num)",
(self.item_list)
)
# 提交sql语句
self.connect.commit()
except Exception as e:
# 出现错误时打印错误日志
log.INFO("数据库插入异常===", e)
self.connect.rollback()
self.cursor.close()
self.connect.close()
完整代码:
链接:https://pan.baidu.com/s/1rH4T-EgUoDSTLBLSjBKCtQ
提取码:fje3