在scrapy的Item pipeline组件中有两个典型的作用,一个是查重并丢弃,第二个是将爬取的数据保存到文件或者数据库中。
以下为用scrapy爬取的豆瓣图书信息,数据清晰和去重都可以在Item Pipeline中完成
class DoubanBooksPipeline(object):
def process_item(self, item, spider):
author = item['author']
if author:
item['author'] = author.strip().replace('\n','').replace(' ','')
series = item['series']
if series:
item['series'] = series.replace('\xa0','')
content = item['content']
if content:
item['content'] = content.replace('\n','').replace(' ','')
about_author = item['about_author']
if about_author:
item['about_author'] = about_author.replace('\n','').replace(' ','')
pub