一个scrapy项目可能你定义了多个item抓取不同的对象,比如Atime抓取页面内容,Bitem下载图片或文件,如何在pipeline里面处理多个item的问题呢?
原理很简单在pipeline里面按item的类型判断,是Aitem就按APipeline处理,是Bitem就按Bpipeline处理。
注意:不要丢弃不能处理的item
以抓取同时页面和下载图片为例:
1、首先在settings里面设置ITEM_PIPELINES
ITEM_PIPELINES = { #网页抓取 'douban.pipelines.doubanPipeline': 10, #文件下载 'douban.pipelines.doubanFilePipeline': 100, } #图片下载地址 FILES_STORE ='D:\\1' FILES_EXPIRES = 90 #90天内抓取的都不会被重抓
2、在pipeline里面定义
import scrapy from scrapy.pipelines.files import FilesPipeline from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem from douban.items import doubanTextItem,doubanItem import json class doubanPipeline(object): def process_item(self, item, spider): if isinstance(item,doubanTextItem): #判断item是否为doubanTextItem类型 name = item['title'] + '.txt' with open(name, 'a', encoding='utf-8') as f: text = "".join(item['text']) f.write(text) return item #返回item
class doubanFilePipeline(FilesPipeline): def file_path(self, request, response=None, info=None): image_guid = request.url.split('/')[-1] file_name = image_guid.split('.')[0] + '.jpg' name = request.meta['name'] if len(name): file_name=name+'/'+file_name return 'full/%s' % (file_name) def get_media_requests(self, item, info): if isinstance(item, doubanItem): #判断item是否为doubanItem类型 for image_url in item['file_urls']: if 'http' in image_url: name = item['name'] yield scrapy.Request(url=image_url, meta={'name': name}) def item_completed(self, results, item, info): if isinstance(item, doubanItem): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") # item['image_paths'] = image_paths return item
3、在spider里面实现各种item的定义