首先,话不多说,先上scrapy-item pipeline
之前数据都是存到数据库和json中,用的时候还得转成excel格式,挺麻烦,所以今天查了一下发现了openpyxl这个库,在此小记..
from openpyx import WorkBook
#创建工作簿,同时页建一个sheet
wb = WorkBook()
#调用得到的sheet,并命名为test1
ws = wb.active
(注:active返回的是一个列表)
@property
def active(self):
"""Get the currently active sheet or None
:type: :class:`openpyxl.worksheet.worksheet.Worksheet`
"""
try:
return self._sheets[self._active_sheet_index]
except IndexError:
pass
ws.title = 'test1'
#插入数据
ws.append([...])
#保存工作簿,在当前目录下文件名为test1.xlsx
wb.save('test1.xlsx')
项目代码
class CdcspiderExcelPipeline(object):
'''
use Item Exporter
save the item to excel
'''
def __init__(self):
'''
initialize the object
'''
self.spider = None
self.count = 0
def log(self, l):
'''
reload the log
:return:
'''
msg = '========== CdcspiderExcelPipeline == %s' % l
if self.spider is not None:
# spider.logger -> return logging.LoggerAdapter(logger, {'spider': self})
self.spider.logger.info(msg)
def open_spider(self, spider):
'''
create a queue
:return:
'''
self.wb = openpyxl.Workbook()
self.ws = self.wb.active
self.ws.append(['文章日期', '文章标题', 'url', '文章作者'])
def process_item(self, item, spider):
'''
save every
:return:
'''
self.log('process %s, %s:' % (spider.name, self.count + 1))
line = [item['article_time'],item['title'],item['url'],item['author']]
self.ws.append(line)
return item
def close_spider(self, spider):
'''
save lines to excel
:return:
'''
print 'ExcelPipline info: items size: %s' % self.count
file_name = _generate_filename(spider, file_format='xlsx')
self.wb.save(file_name)
结果如下
此外,scrapy提供了item exporter进行持久化或者导出,但笔者本人使用觉着不如第三方库方便,当然可能跟小编水平有限相关哈哈.